• • - DOCUMENT RESUME 

ED 212 650 ' „ TM 820 024 



AUTHOR 
TITLE 

INSTITUTION 

'■•i X 

S£ONS AGENCY 
PUB DATE 
GRANT. \ 
NOTE * \ 



EDRS PRICE 
DESCRIPTORS 



IDENTIFIERS 



'Quellmalz, Edys; And Others 
Studies In Test Design: Annual Report. 
California Univ., Los Angeles. Center for the Study 
of Evaluation. 

National' Inst, of Education (ED), Washington, D.C. 
Nov 81 

NIE-G-80-0112 

323p. ; For related documents, see ED 211 592 and TM 
820 026. d • 

MF01/PC13 Plus Postage. 

Cost Effectiveness; Criterion Referenced Tests; 
Elementary Secondary Education; Higher Education; 
Learning Processes; "Measurement Techniques; 
Pictorial Stimuli; Research' Utilization; Responses; 
Scoring; Student Placement; *Test\ Construction; 
♦Testing, Problems; *Test Validity; "Writing 
Evaluation; Writing Instruction; Writing" Skills 
Inter Ra£er Reliability % ' , 



ABSTRACT _ 

* This document contains the following manuscripts: 
"Effects, of .Alternate Scoring Options on the Classification of 
Entering Freshmen Writing Competencies," by Edys .Quellmalz and Eva 
Baker; , "Implications of Learning Research tor Designing Competency 
Based Assessment ," by Edys Quellmalz y "Effects of Alternative 
Discourse and Response Modes, on Characterizations of Students, 9 
Writing Performance," by Frank Capell f Edys Quellmalz and Chi Ping 
Chou; "Problems in Stablizing the Judgment Process," by Edys 
Quellmalz; "Effects of Visual or Written Topic Information on Essay 
Quality," by Eva Baker and Edys Quellmalz; "Effects of Time and 
Strategy Use on Writing Performance^" by Linda Polin: ^Designing 
Writing Assessments: Balancing Fairness, Utility andr£ost,. w by Edys 
Quellmalz; "The Measurement of .Students 9 Writing Performance in 
Relation to Instructional History, ? by Marcella Pitts; "Measures of 
High School Students 9 Expository Writing: Direct and Indirect 7 
Strategies," by Laura Spooner-Smitlr; and "Alternative Scoring Systems* 
for Predicting Criterion Group Membership," by Lynn Winters. 
(Author/BW) 



************************ ************************** ****** *************** 

* ' Reproductions supplied by EDRS are the best that can be made * 

* from the original document. * 
**************************************** * ( * ******* ******** ************** 



tenter for the Study of Evaluation 



UCLA Graduate School of Education 
Los Angeles, California 90024 



DEPARTMENT Of EDUCATION 
NATIONAL INSTITUTE OF EDUCATION 
EDUCATIONAL RESOURCES INFORMATION 

CENTER (ERIC) 
f* This document has bean reproduced te 
receryfJ from the person or organization 
orwrtifag it. 
<, O Mmo» changes have been made to improve 

! reproduction quatrty. 
• Pocnts of view or opinions stated in this docu* 
J ment do not necessarily represent official NIE 
• pociuoq or pobcy. 



/ *1 



c 




3 



i 



iERIC 



■pa «n 



2, 



"PERMISSION TO REPRODUCE THIS 

MATCFHAL, has been granted by 



TO THE EDUCATIONAL RESOURCES 
INFORMATION CENTER (ERIC)." » 



\ 



Deliverable - November 1981 ° . 



STUDIES IN TEST DESIGN 



Annual Report 
Edys Quellmalz, Project Director 



Grant Number 
NIE-G-80-0112 



CENTER FOR THE STUDY OF EVALUATION 
Graduate School of Education 
University of California, Los Angeles 



3 



3 




This document contains manuscripts prepared By 
the Test Design Project for refereed publications. 





r ERJC 



' t Test Design: Studies in Writing Assessment ■ 

* * » « 

■ Annual Report 1 , • % * 

y November 1981 . 

i * i 

4 § t * • 

Effects of Alternate- Scoring Options on the Classification of 

Entering Freshmen Writing Competencies . - . 

> 1 - Edys Quellmalz and Eva Baker 

Implications .of Learning Research for Designing Competency 
• Based Assessment - 4 ' 

- Edys Quellmalz • f 

Effects of Alternative Discourse and Response Modes on Characterizations 
of Students' Writing Performance 

- Frank Capell, Ed^s Quellmalz and Chi Ping Chou 
Problems in Stab) i zing the Judgment t Process 

- Edys Quellmalz ^ * 
Effects of Visual or Written Topic Information on Essay Quality 

- Eva 'Baker -and Edys Quellmalz * • *> 
Effects of Time and Strategy Use on Writing Performance 

- Linda Pol in 

Designing, Writing Assessments: balancing Fairness, Utility and ' 
Cost V 

f Edys Quellmalz 

The Measurement of ^Students.' Writing Performance in Relation to' 

''Instructional History » ' 

- Marcel la ^Pitts ^ * „ ;* 

Measures of High School Students' Expository Writing: Direct and 
Indirect Strategies, 

- Laura Spoorter-Smittr , 

Alternative* Scoring Systems for Predicting Criterion Group 
Membership 

** * f 

. -.Lynn Winters " 



J 



■ EFFECTS OF ALTERNATIVE SCORING QPTIQNS ON THE 
CLASSIFICATION OF ENTERING' FRESHMEN WRITING COMPETENCIES 



Edys Quellmalz and Eva Baker 



Center for, the Study of . Evaluation 
Graduate School of Education 
University of California, Los Ange]es 



The project presented or reported herein- was performed 
pursuant to 'a grant from the National Institute of Edu- 
cation, Department of Education. However, the opinions 
expressed herejn do not necessarily reflect the position 
or v policy of. the National Institute of Education, and no 
official endorsement by the National Institute of Educa- 
tion should be inferred. 



6 



. 4 Among tfie many •criticisms of the quality of public education * com- 1 

c 

plaints about students 1 inability \o write prose lead the pack. At the 
time of college admission when students need to be assigned to'beginning 
English courses, writing deficiencies become especially salient. At en- 
trance to college, students may be assigned to college-level beginning 
English courses, or with greater /frequency, may-be placed in a special 
: course- designed ttf remedy composition problems and to prep^e for regular 
college level work. This initial placement decision is made through dif- 
ferent means. Some sqhools base their decision solely on student verbal 
scores on a college entrance examination. 4 Others require that alf^students 
take a special placement examination. These examinations may vafy'in their 
development history (locally prepared or commercially published^ definition 
of writing {narrative or expository prose), format (multiple choice -or 
essay production), and manner by which the passing score is determined. 
An ideal and experimentally clean way to make choices among sucfj alter- 
natives would involve the systematic variation of some of these variables 
to determine -which procedures provide the least mistaken estimate of stu- 
dents' writing ability. In fact, admission is a 'serious business and little 
experimental "fooling" with the system fs tolerated in real, col l'eges and h ' 
universities, even for the promised benefit of improved decisions. J ' 
This study, however,- is an attempt to contrast alternative .assessment 
methods in actual placement testing.' Its practical impetus grew from 
specific requirements in the higher education system in California. As 



V 




background, California has two, state-wide university systems:. The Uni- 
versity. of California (UC) and the California State University and Colleges 
(£SUC). Although the systems are designed to attract different levels of 
students (at UC, the top 12%% statewide and at CSUC, the top 33%) students, 
may transfer from system to system or to different campuses, with, . the 
same system. CSUC consists of 19 campuses, and to standardize require- 
ments among campuses, a committee of faculty cooperated with the Educa- 
tional Testing Service, (ETS) to develop^ system-wide test of English com- 
position placement, the English Placement Test (EPT) . The UC system of 

* ^ ** 

nine campuses operates so that each campus! unique placement test (called 
the Subject A examination) is honored by the other campuses. Since CSUC • 
students .often wish to transfer to. UC schools, a study group made up of 
faculty from both. systems was appointed to" review the need for coWon writ-' 

* ■ r * 

ing placement procedure for all UC and CSUC campuses. The use. of the English 
Placement Test was suggested .b/the CSUC representatives. 

The problem in its most simple form is whether the EPT would provide 
the same quality of information thought to be obtained through the existing 
procedures at UC campuses. Could a* test desi.gned for a population consist- 
ing of the top one-third of students operate efficiently for the fop 12%%? 

Embedded in this problem are a number of serious issues related" to 

the teaphing and testing of writing. For a start, few agree on the defi- 

nition of writing competence itself . A common, but .'operationally vague 

desire is ±hat students ought to write .well enough to succeed in other 

college courses, as .if success were an unidimensional phenomenon, ' In factj • 

' , . * • 

Smith (1975) demonstrated" that requirement?,, for success vary from college 



ERIC 



8. : 



specialization to specialization* Definitions of competence may focus 
on particular features of writing, such as structural or grammatical ele- 
ments. In other views, acceptable mechanics are a minimum, but emphasis \ 
is given, in addition, to the quality of thought or to the logic and clarity 
of the communication* 

A second issue running through this study is the form of student 
response used to make the decision. Some tests of writing rely heavily 
on "indirect" measurement, where performance on multiple chc>ice tests is 
used to 11 predict 11 * writing achievement, these tests are justified along 
these connected lines of argument. First, the correlation coefficients 

c 

of written essays and multiple choice tests are high enough that the 

"validity" of the, objective test should not be challenged. The tests are 
f 

functionally thought to measure the "same thing" (Godshalk, Swineford, & 
Coffman, 1966; Breland &/Brauctier, 1977). Given this equivalence, effi- 
ciency favors choosing the least expensive method, and objective tests 
are easier and cheaper to administer and ; score\ The scoring argument is 
bolstered by the well-known differences in raters 1 judgments of essays, 
that is, the matter of scorer unreliability. 

Proponents of collecting*, writing samples from student's argue that 
the cognitive requirements of creating essays and' answering a series of 
multiple choice tests differ markedly from one another, and that no amount 
of statistical modelling can actually equate writing with choosing the V 
right answer, (Spooner- Smith, 19^8; Quellmalz & Capell, 1979). Further 
criticisms of rater unreliability are countered by the results;! 
of good training procedures. However, the cost issue remains, cast by 
these advocates as a choice between cheap, irrelevant 'information or more 



costly, valid data. 9 

•A. third issue applies to any definition or format for the assessment „ 
of student writing competence: how are standards of passing or failing 

« 

set? -Does.the standard treat equally the two forms of potential misolassi- 

fication, competent students who "fail" and incompetent students who "pass"? 

.***.•. ' , 

7s there a policy -that the benefit of the doubt goes to the student? Does 

• * •. ' - ' 

' t, i system so value its definition of writing that it wishes to he conser- 

' vat>ve about who gets to enter college English courses? 

A last, but Critical issue arises for those' who htfve opted' for the 
collection of essay responses. Mot only questioned are the number, type, 
and length of responses necessary for accurate judgment, but also* heated • 
disagreement occurs over the best scoring procedures. The choices are 
between hdlistic scoring, which gives an overall estimate of the essay, 
and analytic, scoring which provides subscores for particular characteris- 
tics .of the writing. Again, the conflict is between cost; where holistic' 
scoring takes approximately 2/3 the time of analytic scoring, and precision 
of inffirmation , where analytic scores provide diagnosis of deficient per- 
formance. Strong advocates for holistic scoring cite its econW (Godshalk,' 

« • 

et al., 1966; Alloway, 1978; Powills, Bowers & Conlan, 1979). However, 

i 

feature analyses of good and poor papers point to the distinct differences 
in their content and structure (See Cooper, Cherry, Gerber, • Fleischer, 
'Copley, & Sartisky, 1979), and advocates of analytic ratings argue for 
the use of such information in determining instructional policy for re-, 
mediation (Quellmalz, 1980). , S 

* < 

With contention as a backdrop, then^ the practical problem of choosing 

( t 

* 



a "good" placement procedure for UC was studied. Staff at 'a urviversity- 
based research center- proposed research to compare^ three alternative 

i 

methods for making the placement decision: the use of the English Place- 
ment Test fEPT) (-consisting of an essay, and multiple choice scales) pro- 
posed by the CSUC staff; the 'placement procedure (Subject A* examinations) 
in use at each of. the two UC campuses; an analytic essay ratin£ scale de- 
veloped by the research center in the' course of its studies 'of writing 
(the CSE scale).' Two simple questions were formulated to guide this 1 study: 

1, How comparable are the scores students receive from .each *• 
form' of writing assessment? 

2. Would the methods sort* students in competent and 



incompetent' groups in the same way? I 



• f 



ERIC 



11 



METHODS 

). ' . 

\ ' Overview 

\ ' • ' ■■ 

Each of two UC campuses agreed to participate in the study. Instead 
of Requiring their own Su\ject A examination, each campus administered the 

EPT examination to a sampU c f students participating in regular placement 

\ 

examinations. The EPT essajk was first scored by ETS, rescored at each 
campus using campus scoring ^ocedures (both campuses used holistic rating 
procedures), and then the essays were sent to the research center for re- 
rating according to the CSE analytic scheme. Actual placement decisions 
for each student were made on the basis of the campus interpretation of 
ETS, scores.. \ 

S ubjects 

T hree hundred eight high school seniors were required to take the 
experimental version of the placement examination at either of two UC 
campuses. A placement test for writing, was a regular requirement for 
students scoring between 450 and 600 on the College Entrance Examination 
Board (CEEB) test. - 

Instruments 



The English Placement Test 

The EPT was developed by the Educational Testing Service in collabor- 
ation with CSUkasa placement tool fpr first-year English classes in the 
CSUC system* - The EPT requires 'students to write one 45-minute essay and 
to complete a 90-minute multiple choice section covering three skill areas 
reading, sentence construction, and logic and organization. The reading 



12 



section asks students to identify main ideas and to interpret ideas^in 
short reading passages. The sentence construction test items require 
students to recognize arrangements of sentence elements that "express 
meaning clearly and correctly." The logic and organisation section con- 
tains a variety of item types intended to measure students 1 ability to 
"see relationships, between words." For example, some items require stu- 
dents to arrange, words into categories other items involve identifying 
sentences to begin, end, or support a given paragraph. Still other items 
intend to measure the students' ability to distinguish between fact and 
opinion. The objective part of the EPT counts 75% of the total. 

Essay topic . The essay direction required students to write a 45- 
minute essay on a topic eliciting narrative/descriptive writing. The 
topic of this' administration called for students to write about "a real 
or an apparent change that had occurred in someone they knew." • 

EPT essay criteria . The EPT scoring scale is a six-point holistic 
essay scale -divided into two parts~"upper half papers" and "lower half 
papers." Racers are instructed to read each paper through quickly and ; 
assign an overall rating based on how well the essay addressed itself to* 

i 

all aspects of the question (topic), how well the essay is organized,, and 
how well' it demonstrates writing Quality. Aspects of writing quality men- 
tioned in the rubric are syntax and diction. Papers that do not respond 
to, argue or avoid the question^ are scored zero\ The EPT v/as studied for 
content validity, as reported by Breland and Ragosa (1976). Unfortunately, 
no results were available. 



13 



\ 

UC Campus 1: holistic essay criteria - 

Campus 1 employed a six-point holistic scale which permits readers 
to assign a plus or minus to each point on the scale (l=high, 6=low). 
The rubric' directs raters' attention to the thesis statement and its de- 
velopment,; sentence structure, word choice, and a detailed list of "me- 
chanics" features. Additionally, each point on the scale corresponds 
to a placement decision. For example, scores of one, two or three indi- 
cate that the student is prepared to take a regular freshman cpmposition 
course, While a score of four through six indicates that the student should 

V 

be placed in one of a series of increasingly remedial English classes. 
Campus 1 typically employs a one-hour placement examination.^ 

Campus 2: holistic essay criteria 
. , A six-point holistic rating scale was .also employed by Campus 2 (l=1ow, 
6=high). The rubric emphasizes fluency* and mechanics, although reference 
is made to the logic and organization of the writing scale. In its normal 
placement examination, two one-hour essays are produced by each student 
at GamfJ^s 2. 

CS^\ analytic essay criteria 
c Unlike the three holistic approaches of the other rating procedures, 
the CSE essay scoring provides an analytic rating of each essay (Quellmalz, 
1979). The analytic rubric derived from other scales used for narrative - 
discourse and from texts and tests in composition and rhetoric (Pitts, ' 
1978). The scale presents carefully. explicated criteria developed for 
domain-referenced narrative writing tasks. Scale criteria require refer- 



14 



ence to observable features in an'essay, unlike many rating rubrics which 
include more subjective, affective judgments. The scale consists of five 
'subscales, each with a range of four points. - "* Based on studies suggesting 
that 'holistic and analytic ratings provide distinct information about stu-* 
dent writing, the scale calls for both holistic and analytic ratings 
(Winters, 1978). The first subscale, General Impression, directs raters 
to read the pap^r quickly first and to rate it according to their global 
judgments of its quality. as an example of narration* The remaining four 
subscales attend to the following components of the- writing: focus, or- 
ganization, support, and mechanics. The scoring rubric for the scale con- 
tains a detailed description of essay features associated with each of the 
four levels of quality within each of the subscales. 
Archival student information 

In addition to the three scores generated by ithe rescorinn of-the^ 
required placement exam, Scholastic Aptitude Test (SAT) verbal scores, 
College Entrance Examination Board ^j/EEB) scores, High School English 
course grades and grade point averages were also Available for students. 

Procedures' , • , ' 

. r 

Administration • ' ' * 

Students who came to the required UC placement examination were di- 
vided, as they arrived, into groups taking the regular or the experimental 
EPT administration. Students in the study were placed in the same room 
'and not exposed to*the usual campus procedure. The entire EPT was admin- 
istered according to the publisher's directions. This process was repeated 



/ 

.{ ■ . •■ • ... 

on each of- the two UC campuses in tl^e study. 

E^PT Scoring Procedure u 

The essays rated by the EPT procedures were graded at the same time- 

« 

as a larger pool of essays from all CSUC campuses (n=6,293). Twenty- 
's ^ « 

> seven ratevs were trained in a three and one-half hour training session 
to assign scores according to th£ EPT rubric. Each essay was read by two 
readers and the final score assigned to an essay, was the sum of the tv/b 
scores. ^As the -EPT rubric was a- six-point scale, essay scores ranged from 
one to twelve. Papers with fcores differing by two or more points and all 
papers thdt received a zero score from one reader but a- non-zero score 
from the -6ther reader were read by a third reader. Jhe total essay score, 
in these adjudicated cases was the sum of the two most congruent scores, 
EPT reported that the majority of discrepant scores occurred in the three . 
to five scofce range. v s 

Rater agreement was calculated by a correlation coefficient summanz- - 
ingthe amount of agreement between the first aftd second scores assigned 
to a paper, rather than of the amount of agreement between particular rater 
pairs. The correlation coefficient reported for 5,756 papers Was .59. 
CSE Rating Procedures 

The combined set of 308 essays was rescored at the research center 

usptig the CSE Factual Narrative Scale II. Four raters, English instructors, 

were* hired to read the essays. All of the raters had previous experience 

in. the systematic rating of student essays, and two of the four raters had 

used the particular scale in previous studies. 

CSE rater training procedures were similar to those employed by Spooner 

» 

Smith (1978) and, Winters (1978).. Approximately four hours were devoted to 



11 



review," rating and discussion of 3(Ksample essays on the essay topic. 

« 

At the conclusion of the training session, rater agreement coefficients 
were computed for each of the subscores and the'to^al scale in order to 
determine whether training should be continued. Alphas ranged from .86 ., 
to .92 (based on four ratings per paper), and general inability coeffi- 
cents ranged from .59 to -87. As a. result, readers reread and discussed 

» 4 
• i • O 

the pilot test papers again for the one subscale with low reliability, § s 

focus,, before reading the actual "experimental 11 essays. / Papers~were ran- 

<? ■ 
domly assigned to rafters. 

Campus 1: rating procedure * e J* ' . ) 

Six' teaching assistants experienced in teaching basic writing rated 
the Qampus 1 essays returned by ETSt The £ampus 1 scale, based v primarily 
on a. tally of mechanical* errors , was used to assign essay scores. Each 
paper was read by one reader; raters were department teaching assistants % 
and'were given no additional formal training. * • 

Campus 2: rating procedure f 

Campus 2 -papers were^read by,,seven racers, all^ composition instructors, 
The raters had previous experience 'in rating placement essays for the Eng- 
lish department, so only about one and a half hours were devoted' to rater, 
training. During this session, raters read and . discussed essays on topics 
analogous to the EPT topic^ancj assigned scores according to the Campus 2 
writing exam scale. g ■ 

Each paper was read by two raters; 'the final <score was the sum of the 
tw^ ratings. Papers discrepant by two or more points were read by a third 
reader and the discrepancy resolved in the same manner as were discrepan- 



12 



cies in the EPT- scoring procedure. Campus 2 calculated no interrater 
reliabilities. „ . 



v 

RESULTS 

• Comparability of Assessment Procedures 

• > 

The first section of results addresses the comparability of the 
three alternative m'easures and includes internal analyses of each (see 
Table 1). The EPT^and CSE scores will be. treated first because they 
each provide .subscales. Consider the EPT analyses,. The most dramatic 



ftisert Tables 1 & 2 about here 



findings surround the relationship of the obj'^tive EPT subscales and the.' 
essay score (see Table 2). Each of three subscales strongly correlates 
with one "another, a fact which suggests that-^they may provide redundant 
information. These subscales, taken individually or combined into an "ob- 
jective 11 composite relate only moderately with the EPT essay score analyses 
(ranges of jr between .25 and .30). - , , 

The CSE scale analysis addresses the relationship of the four analytic 
subscores, the total of these scores, and the General Impression, ''holistic" 
score for each essay (see Table 3). The relatively low correlations sug- J 



Insert Table 3 about here 



gest that the particular subscales-are, in fact, identifying separate skill 



-J - 



. . . Table 1 



. * * • 




Means and Standard Deviations 


_ 


» 


£ 




Possible 


• . n - 


X 


i 

s.d.< 


n 


X 


' s.d. 


EPT, TOTAL 
EPT ESSAY 


180 
12 


104 , 
104*^ 


152.38 
7.03 


6.44 «. 
1.38 K 


201 
201 


154.17 
7.37 


3.84 
1.58 


EPT OBJECTIVE SCALES 
















Reading 

Sentence construction 

Logic arijji organization 

Composition 

Total Objective Score 


180 
18D 
180 
180 
• 540 


104 ; 

104 

104 

104 

104 


153.04 
154.72 
153.16 
152.03 
460.92 


8.81 

7.-35 s 

7.89 

6.13' 

22; 11 


201 

201 

201. 

201 

201 


154-.20 
156.01 
154.01 
153.09 
464.21 


12.10 

ll^SL 

11.95 

11.53 

35.00 



CAMPUS SCORING 




' 103 


2.93 


1.26 


201 

* * 


6.61 


2.02 


CSE. SUBSCALES . 
















* General Impression 


4 , 


69 


1.52 


.71 


148 


1.79 


.78 


Focus 


4 


69 ' 


1.80 


.56 


148 


: A.9B 


.55 


- Organization 


4 


69 


1.70 


.63 


148 


■ 2.00 


• 7Q, . 


Support 


4 


69 


' 1.88 


.67 ... 


148 - 


2.09 


.69 


•Mechanics 


4 


69 


« 1.91 


.53 


. 148 


2.35 


• .62 


Total 


20 


' . 69 *„ 


8.81 


2.35 


.148 


•10.17 


2.52 : 



pp?r .19 






J 



v ■ ■ 

EPT .t 
Essay : \ 

t " 

Reading 

Sentence 
construction 

Logic 

Composition 
Objective Total 
Total 

N= 308 * 



CSE Scale 

General 
Impression 

Focus 

Organization 

Support 

Mechanics 

Total 

N - 217 



20 



. '■ TABLE 2 • 

JnternaJ_CMrja.cterJs.ti cs._of EPT_and_CSE Assessment 
" - English Placement Test 



Essay 



.27 

.28 
.25 
.71 
.30 
.62 



Reading 



Sentence 
construction 



Logic 



Composition 



.68 

.71 

.70, 

.91 

.85 



.62 
.79 
.86 
.81 



.78 
.88- 
.81 



.85 
.97 



' TABLE 3 

Center for the Study of Evaluation analytic scale 



'Objective 



Total 



* 



.93 



General 
Impression 



.47 
.75 

;48 

.46 
.85 



Focus 



Organization Support Mechanics Total 



.47 
.46 

N \41 
.72 



.55 
.32 
.83 



.28 
.73 



.65 



21 



•components, The correlation of .85 for the General Impression and th* 
total of the subscales suggests that directing one's attention "to "four 
particular features of writing nonetheless produces values consistent • 

with an overall holistic view.* ' 

, „ . t . 

The comparison between features assessed by the EPT and the CSE 
indicators moVe directly addresses the question of assessment compar- 
ability (see Table 4). - . / ' 



Insert Table 4 about here 



• The essay scores derived from EPT and CSE scoring suggest that only 
moderate amount of overlap -exists* in the scoring rubrics. The holistic 
ratings between ,the CSE General Impression and EPT essay "correlate, in the 
mid-ranges; however, the component skills measured by the CSE analytic 
dimensions and the EPT subscales diverge dramatically. For instance, "or 

ganization" is assessed by both EPT and CSE ..scores, yet the correlation 

v 

between subscales. is only. 12. Sentence construction on the EPT and me- 
chanics on the CSE subscale, apparently comparable dimensions, correlate 
.29. Clearly, the format of the EPT subscale responses foojectlve tests) 
assesses a different capacity than the CSE subscale rating of the essay. 
Comparisons were. also made among the EPT scores, CSE scores, and the 

UC campus holistic scoring procedures. In Table 5, the first column pre- 

_ _ v _ , « 

c 

„ Insert Table 5 about here 



* ■ — l „ 0 

, I ,?/!S t :i. the Vl?* 10 s ? ore is undoubtedly Contaminated by the raters- 
use of the analytic rating scales, after tyie first paper, that is 



22 




« 

. TABLE 4 , " ' 
Cross-Correlations between EFT anchCSE Subscales 
' Campuses- Combined 



- • 


. CSE 

General 
Impression 


Focus ; 


• 

Organization 


^ Support 


Mechanics 


Total 


Issay 


. ;46 . 

* 


.46 


.41 


.42 • 


.38 


.56 


Reading 


.17 - 


w.15 


' .14 ' 


< 

.16 


.27 


.23 


Sentence construction 


.18 


> 

.18 . 


.16 


.15 


• .29 


.25 


Logic & organization 


.14 


.20 


.12* 


.11 




.21 


Composition 


.36 " 


.39 


.32 


.31 


.40 . 


.4 


Objective te^t\ t 


.19 


.20 


:i6 


.16 


.30 


.27 


Total • > 


. i39 


« 

.33 


.28 


.28 


.39 


.42 



f 



23 




TABLE 5 




Correlation of Placement Test Scores from EPT/ Campus 1 
Campus 2, and, CSE 



EPT esSay ( _EPT objective Campus 1 Campus 2 "CSE 

) . * * ■ ' ' 

EPT essay . \' . • . 

EPT - objective .30 • 

Campus 1 .60 .53 * 

Campus 2- ' .25 .08 * . 

CSE .40 .27 "<-y 48 .12 



*Campus 1 and 2 scored only their own students' essays. 



24 . 



sents the simplest contrasts. The EPT correlates at the .40 level with 
the CSE total* .The holistic* scoring procedures at the UC campuses re- 
sults in discrepant, relationships* (at Campus 1, r=.6Q, and at Campus 2, 
r~.25). A low risk conclusion is that "holistic" ratings (as used at 
each aimpus and for the EPT rating) meao different things. In any case, 
inferences about the -stability of these relationships is certainly weak- 
ened by the relatively low inter-rater reliability reported for the FPT 
ratings, the lacMfcf reliability estimates for the UC effects, and the 
potential for error- inherent in the single gating procedure used Cam- * 
pus 1.. Yet, even if -these ratings were reliable, the conclqsion .f^om 'these 
data would be that raters using different systems ooerationalize writing 

in very different ways. * ' . 

' . • ' % ;< 

4 

Relationship of assessment procedure and archival information ' 

Table* 6 presents descriptive statistics^for archival data by campus 
and' Table 7 displays the correlations among different y/riting assessment 
methods and other writing-related archival data often used in placement 
decisions. Making inferences from. such spotty results is dangerous; how-, 
ever, the most consistent relationships are among the College Entrance, 

Insert Tables* 6 & 7 about here 

Scholastic Aptitude, and English Placement Tests. While this relationship 
may result from connections between underlying abi^Uies [for instance, 
comprehension abf.Uyvis assessed on all three measures), -one might argue 
that the fact that these tests originate from' the same publisher, using 



25 



" -TABLE 6 

Means and Standard Deviations 
for. Archival Data* by Campus 



Campus l 
N X S.d. 



High School English trades 61 3.68 .34 

High School Grade Point Average 90 3.68 .28 

College Entrance Examination Board 90 478 79 

Scholastic Aptitude Test (Verbal) 90 492 87 



J • 



1 



26 



\ 



C 

»- „• • — 



•Able 7 

Correlations Between Alternative Placement Scores 
and Other Predictors of College English Performance 



EPT total score 
Campus 1 
Campus 2 
Combined 



College 
£n trance r 
Examination 
Board 



' .54 
.66 
.62 



Scholastic 
Aptitude 
Test 
(Verbal) 



.59 

.64. 

.62 



High 

School 

English 



.14 
.32 
.19 



High School 
Grade 
Point 
Average 



.20 
.31 

:'25 




CSE total essay 
. score 

0 

Campus 1 .26 

Campus 2 * .29 

'Combined .32 



.25 
-.01 
.21 



-.04 
.05 
.00 . 



.01 
.23 
.07 



Campus essay 
score 

Campus 1 

Campus 2 



.22 - .. 
.50 , 



.23 
.31 



.07 
.20 



.01 
.31 



supposedly similar .test development technology, may be a$ plausible a 

r 

link sffiiong them. 

More disheartening, however, is the lack of relationship among writ- 
ing indices and high school and English grade point average. Although 
range restriction definitely must be considered (all students'have a 3.2 
minimum grade point average to qualify for UC admission), one would still 
hope tha£ the grades of these students drawn as they were from the middle 
of the CEEB distribution (450-600 scores), might support the validity of 
the measures. One gloolny view is that high school performance, as measured 
by grades, 'does not include much writing competence. Research on the 
amount of actual precollegiate writing required of students supports this 
analysis (Pitts, 1978). 

A related question is the amount of performance that can be inferred 
to be a specific skill and the amount inferred to be general ability or 
perhaps general information. The relatively higher values for the Campus 2 
procedures may be explained as general ability. This explanation is es- 
pecially interesting in the light of the weak categories in the scoring 
rubric, and the form of rater training.* When no need exists for identi- 
fication and operational statement of criteria in order to achieve set 
levels of agreement among raters, it is reasonable to infer that the 
writers 1 general ability rather than specific writing skill is detected 
by the rating. 4 

Alternative placement decisions .using three assessment models 

To compare the utility of the three methods in view of different 



28 



standards for pass and fail, two analyses were performed: 1) the pass 
score was set at the mean of the scores from the experimental UC distri- 
bution; 2) the cut score, set according to present or recommended practice. 
The best approach for identifying the optimal placement of such standards 
would naturally depend upon developing an adequate estimate of "future 
success" in college writing, and working back from it, to identify the 
minimum requirements for competency. In the absence of such a refined 
external criterion, the alternative placement analyses shed light on the 
differences in decisions made by the various assessment approaches. 

Group analyses • - 

At the group level of analysis, Table 8 displays percentages of 
students who would be placed in remedial classes if cut-off scores were 
1) set at the mean of the UC sample fqr each of the three methods or 2) 
set at the^recommended or regularly used standard. When the cutoff for 
the EPT essay is set at the UC mean (a customary ETS procedure), * 54% of 



Insert Table 8 about here 



UC students would be required to take remedial English. If the EPT cut- 
off score were set at the average of the CSUC population, only 26% of the 
UC sample would be placed into remedial English. This contrast reflects 
the differences' in population^ in the two university systems and suggests 
that if the EPT essay (and its cut-off) were adopted, directly from CSUC, 
then the standard of writing expected at UC would drop. The CSE scale ' 
would place 61% of UC students in the remedial course, with either the 
average or a substantively set criterion score of 10. 



29 



TABLE 8 

Percent of Students Placed in Subject A 
by the Three Scoring Systems 



When clit-off scores = 
UC mean 



When cut-off scores = 
those previously used' 



Combined 



Remedial 



Remedial 



campuses 


N 


Score 


English " 


N 


Score 


English 


EPT essay 


304 


< 7.28 


54 


304 


< 6 


26 


EPT total 


• 304 


<153.62 


48. 


304 


<150 


18 


CSE total 


235 


< 9.83 


61 


235 


< 10 


61 


Campus 1 














• Campus rubric 


. 103 


< 2.93 


*v 

49 


104 




31 


EPT essay 


103 


< 7.03 


63 


104 


1 6 


34 


EPT total 


103 


<152.38 


35 


104 


<150 


20 


CSE total 


, 71 


< 8.61 


51 


71 


< 10 


79 


Campus_2^ 














, Campus rubric 


201 


< 6.61 


40 


200 


< .7 


40 


EPT essay 


201 


< 7.37 


-50 K 


200 


< 6 


23 


EPT total • 


201 


<154.27 


43 


200 


<150 


14 


CSE total 


164 


< 10.35 


53 


164 


< 10 


53 



r- 



30 



24 



[ Contrasts in performance between the two UC campuses demonstrate 

( - - 4 ■ 

that Campus 2 apparently draws from a somewhat more proficient population" 

of writers than Campus 1. ' / ' • 

■ ■ '• • 1 < 

t t 

Individual placement decisions % 

Different .predictions can be made about the placement of any Individ- </ 
ual student under the three assessment methods (see Table 9). Numbers 
in the "off" diagonal represent students who would pass under one system 

• Insert Table 9 abdut here 4 ^ 

and fail according to another (taking pairs of procedures ^one at a time 
for each campus) . For example, at Campus*!, if the pass scort were set 
at the CSUC mean, 30% of the students who pass the EPT essay would fail 
using the regular standards of the campus, and 57% would fail using the 
CSE scale. Placement discrepancies between CSE and Campus 1 procedures 
are greater than oetween Campus 1 and the EPT decisions. Campus 2 place- 
ment decisions similarly demonstrate discrepancies, but with different 
details. For instance, in comparing the CSE with Campus 2 standards, one 
can see that 36% of the students would pass in one system arW fail in the 
other. However, the degree of difficulty (a,s judged by the percentages- 
passing and fa-iling in either system). shows rough equivalence. Thus, in 
tfie case of the Campus 2-CSE comparison, it is the defintion of writing 
competency that accounts for differences in, placement rather than "diffi- 
culty" of the measure. « 



25 



) 



TABLE 9 



Comparison of Placements When Essay Cut-off Scores 
Are Set at Previously Employed Standards 



Pass 
>7 

FaiT 
>6 



Campus 1 rubric 
Pass 



<3 



Fail 
>4 



EPT essay rubric 
Pass 

Fail 
<6 



Campus 1 rubric 


Pass 


Fail' 


13 


>4 


37 


31 




(30%) 


31 


"4 


(30*) 




68 


35 


CSE rubric 


Pass 


Fail 




>10 


5 


40 






10 


15 


(14%) 




15 


55 


CSE rubric 


Pass 


Fail 


>11 


<10 


15 


33 




Mm,. 


0 , 


23 


(0%) 




15 


56 



Campus 2 rubric 



68 



35 
103 



45 
25 
70 



48 

23 
71 



Pass 
>8 



Fail 
<7 



EPT essay rubric 








"Pass 


79 


76 


155 


>7 




(38%) 




Fail 


9 


36 


-45 


-<6 ■ 


' (5%) 






88 


112 


200 



Campus 2 rubric 
Pass 
>8 

Fail 
<7 



EPT essay rubric 
Pass 
>7 

Fail 
<6 



CSE rubric 



Pass 


Fail 


>11 


<10 


47 


29 




(18%) 


30 


58 • 


\m) 


<f 


77 


87 


i 

CSE rubric 


Pass 


Fail 


ill 


510 


* 70 


57 




(35%) 




29 


(4%) " 




77 


86 



76 
88 
164 



127 

36 
163 



32 



DISCUSSION 

■4 

» #' 

' ' $ • ^ 

The findings of the stady dramatize the dilemma facing multi-site 
Educational systems, attempting to establish, uniform, writing competency 
testing* The question^? whether ntwly proposed placement method B is 
better than extant placement method 'A, and the answer is, in this case , 

unfortunately; "It depends/ 1 It Idepends on what you are looking for and 

1 

what evidence will convince you that you have found it. This study under- 
scores the fact that writihg is not atwindifferentiated skill construct and 
that different tests may measure or emphasize very different aspects of 
the writing competency domajn. 

The questions guiding this study structured information about the . 
A** •*• ' 
consequences of using different assessment' methods: 1) Are descriptions 

of student writing competence "provided by the proposed placement exam 

comparable to campus methods in use or to an analytic essay scoring scheme? 

s and_2)J)o alternative placement methods result in. the same placement de- 1 



cisions? The answer to both of these questions is, basically, "No. M 

The data indicate that descriptions of a student's writing competence 
derived, from the three alternative measures, the EPT (essay and objective 
tests), the local campus rubrics, 'and the CSE essay scale differ conoid- 
erably* These differences are indicated by the generally low correlations 
among the placement methods and other Writing-related indices, and, most 
importantly, by the discrepant classification of the same student as master 
or non-ftaster* These empirical analyses suggest a need to return to a 
logical and psychological analysis of the content of the thee measurement 



/ • . ' 

approaches as they relate to what is meant by writing competence. 

p 

The low. or moderate correlations of the ratings generated by the 
EPT, UC campus and CSE rubrics imply that'the criteria in these scafes 
emphasize different essay features. A look at the content of the rubrics 
confirms these differences. Even when nominally' similar methods were used, 
empirical differences were found. For instance, both the EPT and Campus 2 
rubrics were applications of the ETS holistic scoring procedures applied 
in large scale writing assessments (Conlan, 1976; Alloway, 1978; Powifls, 
et al.,-1979). Yet the same basic approach results in clearly different 
specifications and applications of criteria by different sets of ratefs. 
These results, at minimum; challenge the stability and validity of holistic 

e 

scoring for placement and competency decisions, where it is critical that 
consistent criteria be applied fairly to all students. 

Our data illustrate that, contrary to folklore, competent writing 
does not "surface" apart from the details of the rating scheme. The view 
of writing competency reflected in arty rating pjocedure vastly influences 
what happens to students. The results of this^ study were presaged by 
earlier work. In a study of the effects jof alternative response criteria 
1n holistic, analytic and quantitative rating schemes, Winters (1978) also 
found that the scales differentially profiled the same set of essays and 
% characterized students as masters or non-masters. Furthermore, she re- 
ported that imprecisely worded criteria were refined and clarified by 
raters during training, and she hypothesized that a new set of raters would 
refine and apply the criteria differently. 



34 



This study suggests that the design of writing placement assessments 
require detailed and systematic consideration of a range of test develop- 
ment issues. Methodology for designing domain-referenced tests (DRT) in 
-general (Hivety, 1974; Baker, 1974; Popham, 1978, 19^0) and for domain- , ■ 
referenced writing assessment in particular (QueVHfialz, 1978, 1980; Baker 
& Quellmajz, 1979) may provide a useful appr^n to developing. or select- ' 
ing writing assessments. Such methods begin with a detailed definition 
of desired writing competencies and then require precise domain specifi- 
cations for the rhetorical features of/u\e writing task, explicit criteria 
in the rating scale, and reliable procedures for using the scale. These 
specifications permit examinations of the planned placement test by subject 
matter and testing experts prior to the test administration. For example, 
screening of the task structure and scoring procedures in this study mighty 
have resulted in changing the essay task from a narrative one to an exposi- 
tory task.more representative of the type of writing required in college 
courses. Examination of the planned scoring methods iriight have resulted 
in the calculation of interrater reliability for Campus 2 and for the scor- 
ing of placement essays by more chan one rater for Campus 1. 

The design of the domain of task and scoring features far a particular 
placement test also can provide a blueprint for guiding development of com- 
parable, parallel writing tasks, rating criteria and rating procedures, 
assuring the fairness of decisions from occasion'to occasion and site to 
site. In the ideal- case, evidence should indicate that the placement test 
discriminates between surviving and floundering college writers. This study 
emphasizes the need for a systematic approach to selecting or developing 



35 



writing competency tests ♦ Perhaps through domain -referenced testing 
methods and continuing longitudinal research on writing assessment prob- 
lems, we can improve the confidence we place indecisions about writing 
ability. ' " ' 



36 



30 



References . 



Alloway, J. E. Some ways of establishing criteria for assessing writina 

' EJ^Jte? 1 th f pef ^ ective <* 'tUit developer "aper ^rel 9 . ' 

B&k tl*» L ? ? ue U!l! al2 » E - S. Results of pilot studies . Report to' 
the National Institute of Educ ation, Los Angeles: UCLA Center for 
the Study of Evaluation, 1979. (OB-NIE-G-78-0213 ) 

Bak f^ *: t* ^ on i °Wectives: Domain^referenced tests for evaluation- 
and instructional improvement.- Educational Techno! ng y. 1974, 14? 1 0-21 

Breland, H. M., & Braucher, J. L. Measuring writing' ability Paper Dre- 
1a%1on,1ew h Y e orkH977? et1n9 ° f the ^ ?Can "ucali^esK j&c- 

Breland, H. 4 & Ragosa, D. Validating placement tests. Paper presented at 
■ San FrTisco 6 ? lS?6° f Educational Resea "* AssKo^' 

C ° n l a i' G, w Ho ?^ he essay in the CEEB English, test is scored. Princeton 
N.J.; Educational Testing Service, 1976. ' ewn ' 

Cooper , C., Cherry', R.;, Gerber, R.," Fleischer i S., Copley, B , & Sartiskv 

L„ Hr "I n9 , ab1l J t1 ^ ° f ^^-aclmitted freshmen at SUNY/B%ffalo 
University Learning Center and Graduate Program in English Education 

BTKT BSffai e o?lSf9 and D3?art5,ent ° f En9l1sb » StafeUm-vers?^ > 

6 ° d ab?m/* L wJ^V*' £ ? ffman> W ' E ' The measurement of wri ting 
- a - bi1lt y- New York: College Entrance Examinatio n Board, 1966. ~ f ~ m ~ 

" 1 V l 974 , W 14 , 1 5! ?S? UCtl ° " t0 domain "r eferenced testi n9. Educational Technolog y. 

PU w ^i.i he l f e1a i tio " shi P of classroom instructional characteristic 
and writing in the. de scriptive/narrative mode . Report, t» .th» Nat^nal 
institute of Education, lSs Angeles: UCLA Cinter for the Study of , 
Evaluation, 1978. (Grant No. OB-NIE-G-78-0213) , 

* P iSit¥^ nieasurement. Engl ewood Cliffs, N.J.: 

Popham, W. J. Domain-referenced strategies. In R. A. Berk (Ed.), Criter- 
ion-referenc ed measurement . Johns Hopkins University Press, 19 80 ,' - 



31 



Powills, O.A., Bowers, R., & Conlan, G. Holistic essay scoring: An 
application of the model for the evaluation of writing ability and - 
the measurement of growth in. writing ability over time. Paper pre- 
sented at. the annual meeting of the American Educational Research 
Association, San Francisco, 1979. 

Quellmalz, E. Assessing writing proficiency: Designing integrated multi- 
level information systems. Paper presented at the annual meeting of 
the National Reading Conference, San Diego, >CA, 1980. 

Quellmalz, E. Defining writing domains: Effects of discourse and response 
mode. Interim report to the National Institute of Education, Los 
Angeles: UCLA Center for the Study of Evaluation, 1979. (Grant No. 
OB-NIE-G-78-0213) , 

Quellmalz, E. » Domain-referenced specifications for writing proficiency.. 
Paper presented at the annual meeting of the American Educational . » 
Research Association, Toronto, 1978. 

\^ * . • 

Quellmalz, E., & Capell, F. Defining writing domains: Effects of discourse 
and response mode . ^Report to the National InstiWte of Education, Los ' \ 
Angeles: UCLA Ce/iteV for the Study of Evaluation, 1979. (OB-NIE-G-78-0213) 

Smith, L. An assessment. of writing needs of undergraduates in the life ;< 
sciences.and social sciences divisions at, UCLA. Unpublished thesis, 
University *of California,- Los Angeles, .1975. 

Spooner-Smith, L. Investigation of writing assessment' strategies . Report 
to the National Institute of Education, November, 1978. (Grant No. 
OB-NIE-6-78-0213 to the UCLA Center for the Study of Evaluation) 

Winters, L. The effects of differing response criteria on the assessment 
of writing competence . Report to the National Institute of Education, 
Los Angeles: UCLA Cnter for the Study of Evaluation, 1978. (Graot No. 
OB-NIE-G-78-0213) 



IMPLICATIONS OF LEARNING RESEARCH 

» 

. 4 

FOR. DESIGNING COMPETENCY BASED ASSESSMENT 



Edys Quellmalz 



Center for the Study of Evaluation 

.Graduate School of .Education 
University of California, Los Angeles 



* . * 

The project presented or reported herein was performed 
pursuant to a grant from the National Institute of Edu-:' 
cation, Department of Education. However, the opinions 
expressed herein do hot necessarily ^reflect the position 
or policy of the National Institute of Education/ and no 
official endorsement by the National Institute 'of Educa- 
tion should be Inferred. 



39 



p 

1 • IMPLICATIONS OF LEARNING RESEARCH 

FOR DESIGNING" COMPETENCY BASED ASSESSMENT , n * 

* - „ . 

pne might argue that theory and methodology in learning research 
should generally precede arid inform theory and practice in instruction 
and testing. Although such linear patterns of research and- development 
rarely are found in practice, findings in learning can affect test de- ' 
sign. This section explores the relationship- of learning research and 
test design in order to provide insight into current, probl ems and to sug- 
gest future directions for test desigrr. The section begins by briefly 
tracing the roots of domain-referenced testing in behavioral psychology, 
and how current problems in design may be attributed both to superficial 
application of behaviorism, and to the limits of that learning paradigm. 
The section then presents implications for test desigo of more recent 
cognitive processing learning research. 

,The reader should note that this review reflects the perspective that 
testing should >be integrated with instruction and should inform instruc- 
tional decision making. This perspective is derived largely from learn- 
ing and instruction research paradigms and asserts that both test and in- 
structional tasks should be formed from the same or compatible specifica- 
tions'. Further, to be maximally useful, these specifications should re- 
flect learning research and describe precisely the content and response 
limits for both testing and instructional tasks. In other words, task 
variables which affect student performance must be considered and specified 
in instructional and test designs 



Domain-Referenced Testing and Behavioral Psychology 

As noted in previous sections, domain referenced testing finds its 
origins in Skinner's operant theory of human behavior (Skinner, 1954). 
Skinner's analysis of learning identified two salient elements in learn- 
ing: a stimulus and an overt behavioral response elicited by the stimulus. 
Mental processing of stimulus information the learner was excluded from 
the learning model since internal, unobsejJvabTe phenomena were character- 
ized as inaccessible to hypothesis testing. The methodology and theory 
developed in this hypothesis testing resulted in several interrelated 
principles applicable to domain-referenced testing and instruction: 

1. Stimulus and response requirements must be rigorously described; 

2. Learning and criterion tasks must be >repli cable; 

3. Learning tasks must match criterion tasks (or test tasks). 

The requirements for carefully specifying task content and response 
limits advocated by DRT proponents (Hively, 1974; Bater, 1974; Popham, ' 
1980; Pol in & Baker, 1979) thus can be derived directly from the experi- 
mental paradigm, of behaviorisjjr. However, while these researchers call 
for replicable, rigorous test'task specification, practice has generally 
not heeded the call. Most curriculum programs and commercially published 
tests, for example,- have been developed from vague and imprecise specifica- 
tion and do pot attend to content and response dimensions that affect stu- 
dent performance (Anderson, 1972; Wardrop, Anderson, Hively', Anderson, 
Hastings, & Muller, 1978; R'oyer & Cunningham, 1978; Quellmalz, 1980).^ 
Practice generally has ignored, also, the demand for congruent learning 
and test tasks. Critiques of instructional programs and curriculum embedded 



41 



tests document mismatches between instructional objectives and tests pur- 
porting to measure those objectives (Baker & Spooner-Smith, 1977; Quell- 
malz, Sjridman, & Herman, 1977). Even learning researchers apparently have 
not attended very closely to requirements for task and criterion match: 
Anderson's (1972) review of 130 research articles found that less than 
33% gave any rationale for test selection or development, and 51% provided 
not information about the relationship of content and response requirements 
for test items. and experimental conditions. Montague (1980) similarly 
noted that reading researchers, paid inadequate attention to the alignment 
of independent and dependent variables- General awareness of the need 
for precise test and task specification, then, has been lacking. 

A second major problem in the state of the art DRT methodology has 
been inattention to the cognitive processing operations required for stu- 
dents to produce the observable specified response. This problem can be 
traced to behaviorism's research focus upon overt, observable behaviors 
and its exclusion of covert, hypothetical processes mediating those re- 
sponses, although response hierarchies were proposed (Gagne, 1977), The 
new wave of cognitive psychologists considers the exclusion of cognitive 
processing in a learning model an error Of omission. Critics of objectives- 
based technology and behaviorsim repeatedly cite the need for methodol- 
ogies of learning, instruction, and testing to identify, specify, and 
account for response complexity in terms of information processing which 
the student must activate to engage the content of the task and produce 
the requested response (Chomsky, 1957); Greeno, 1976). 

The recent shift in learning research from stimulus-response behav- 

/ 

* 

♦ 

■ 

* 42 



i or ism to cognitive learning is producing findings with profound impli- 
cations for the design of compatible test and instructional tasks. Learn- 
ing has become viewed not as a passive reaction, but as a constructive 
one. Researchers have resurrected notions from Gestalt psychology and 
reanalyzed Bartlett's constructionist account of retention (Bartlett, 1932). 
The learning act is now described as "interactive) 1 (Solomon, 1980) or 
"generative" (Wittrock, 1974). Research in perception regarding the in- 
fluence of "anticipatory schema" of what was visually perceived (Nessier, 
1967) blends with a schema theory of learning (Anderson, 1977). Language 
research (Chomsky, 1957) and perceptual research (Nessier, 1967; Paivio, 
1971) suggest that learners extracted and abstracted generic rules, strat- 
egies, or knowledge representations from repeated encounters with a corpus 
of oral language, text or problems with recurring elements. These hypo- 
thesized internal "frames" (Minsky, 1975) are called schemata, defined as 
data structures for representing generic, stereotypic concepts stored in 
memory (Rumelhart & Ortony, 1977). Schematu both determine what information 
is retained or produced and what it "means." Emphasis has shifted from the 
end product, a response, to the operations required for the response to 
occur. 

The Cognitive Research Paradiqrc 

Much of learning research in the structure of knowledge in the domains 
of math, reading, writing, oral language, and artificial intelligence is 
now attempting to identify the types and sequences of processes, operations, 
or routines that demark different levels of developing skill in these do- 
mains. A salient feature of the paradigm is rigorous description and manip- 



u.latlon of the task content, and intensive analyses of the steps or op- 
erations leading to the response. Less often is responding elicited in . , 
a recognition, selected response, format than in' a production format. 
Often research methods include extensive protocol analysis of processes 
reported by the 1 earner. 

Types of Knowledge - 

In an attempt to categorize and form a taxonomy of learning task types,, 
cognitive learning researchers make some distinctions among types of knowl- 
edge. These distinctions- have important implications for analysis and thus 
Specification of the content and response limits of test tasks. Earlier 
work (Gagne, 1977} had proposed types of content: knowledge, facts, con- . 
cepts,- principles, and a corresponding response hierarchy. Cognitive re- 
searchers propose similar content-response distinctions. Brown, Campione, 
and Day (1980) write that knowledge divides into three types: strategic, 
content-tfj\J factual , anrd meta-cognitive. Hays-Roth (1980) delineates five 
types, of knowledge: general cognitive skills, declarative knowledge, pro- 
cedural knowledge, motivation, and attitudes and 'beliefs. In their dis- 
cussion of measurement problems in reading comprehension, Royer and Cunning- 
's 

ham (1978) distinguish among domain skills, topical world knowledge, and 
reasoning ability. Other researchers characterize information processing 
involving two components: a knowledge or content structure" component con- 
taining a network of concepts and relations, and a set of cognitive processes 
for operating on the content (Anderson, 1977; Greeno, 1976; Norman & Rumel- 
hart, 1975; Chi & Glaser, 1980). * 
Common throughout these characterizations of knowledge is the distinc- 



9 

ERIC 



44 



tion between content and operations, parallel to the distinction, made in 
domain-referenced task specifications. However, these general typologies 
begin to suggest even more jlCiensions that could. apply to learning, in- 
struction, and test tasks specification. Most critical for the formula- 
tion of valid tasks is -an emerging awareness of and attempt ,o separate 
content and operations integral to mastery of skills in a subject matter 
domain from content and operations irrelevant to that domain. 

Variables in Cognitive Learning Research Relevant to .Test Design 

With its focus on processes and the conditions influencing them, dif- 
ferent Variables assumed prominence in' cognitive learning research. The 
remaining parts of this section will discuss the variables most relevant 
for domain-referenced test design. 

Antecedent conditions . Cognitive learning researchers renewed their 
attention to perceptions, values, social, cultural, and language experi- 
ences and to information processing strategies the learner brought to the 
learning talk. In particular, these foci emphasized the critical influence 
of antecedent conditions on the learners* engagement in the immediate task 
(Flavell, i977).- Researchers asked, what* features of the task did the 
learner attent to; what schemata, scripts, andplans did the learner have 
as he engaged in a new task? These cognitive processing questions expanded 
behaviorists' concern with "entry level skills," instructional specialists' 
concern with pretesting and, by implication, test designers' attention to 
item difficulty. Some researchers suggested that the influence of antece- 
dent conditions implied a need for "tailored tests" (Rudner, 1978) or tests 
sensitive to individual differences in "world knowledge" and in previously 



45 



established content and response schema,- Other research attempted to 
account or provide for antece^nt conditions through more refined de- 
sign of the learning task. • > . 

Because of th€ dtverse task dimensions that cognitive research sug- 
gests are important for* eliciting performance, the research relating to 
task elicitation conditions will be discussed as It pertains to separate 
dimensions within the content limits. and response limits of tasks. 



Content Limits 

1) Contex t. Among the most striking findings of cognitive learning 

* . 

research has been the sensitivity of performance to the "context" in which 
the task is set. "Context" includes not- just the physical sretting, but 
the social setting in which the learning., (and testing) takes place. Writ- 
ing and oral language researchers criticize Jthe decontextualized features 
of most school language tasks (instructional and assessment) (Qlson, 1577; 
Scribner & f Cble, 1978; Cazden, 1974; Britton, Burgess, Martin, McLeod, & 
Rosen, 1975; Florio, 1979). They cite the importance of perceived relevance 
and communicative intent in "real 11 speegh acts and Writing acts as important 
factors in determining learner motivation and deployment. of particular 
strategies. Knowledge of a real and specific purpose and audience differ*- 
ently influence what content, form, and style speakers and writers mobilize 
(e.g., Olson, 1977; Britton et al.,^1975)/ To'the extent that the occasion 
is natural, familiar, and meaningful to the learner, the writing or speech 
is more .or less motivated and facilitated. When language performance is • 
elicited in decontextualized settings, described by Britton as typical 
assessment settings, then other, perhaps, 'fcevelant variables intrude 



? 

•46 



into the task (pritton et al.». 1975). n " 

In addition to information about audience and function, the context 
also includes a time allocation for task completion. Ability tests often 
attempt to differentiate individuals through "power" or speeded testing. 
Formal 'achievement assessment often sets arbitrary time limits for task 
performance. Cognitive research on information processing related to 
subprocesses in a task (e.g., decoding in reading, sentence construction 
in writing) suggests that less proficient learners might require more time 
for subroutines which have become automated routines for masters (Nold, 1979; 
Norman S Bobrow', 1975; Stallard, 1978). Thus time limitations for assess- " 
ment tasks may not permit students at varying competency levels to complete 
processes they can perform, e.g., decode and comprehend; plan, write, edit; 
analyze and solve a problem. Some researchers (see Cooper, 1979) suggest 
that formal, time-limited assessment should be replaced by sampling of 
york completed under more "natural 11 classroom or life conditions. * 
•J 2) Topicality . In addition to context, .the influence of a second 
task feature, "topicality 11 or "world knowledge," has becc^ne a popular issjje 
in cognitive learning research. World knowledge refers to networks (or ( - 



a "memory bank") of information about world phenomena learners have in 
their repertoire. In the subject domains of reading, writing, oral lan- 
guage, and math, world knowledge is considered a prerequisite vehicle for 
exercise of comprehension, syntactic, or problem solving skills. As Royer 
and "Cunningham (1978) point out* a student cannot aplply a main idea iden- 
tification strategy to a passage about a topic if the students insuf- 
ficient topic knowledge to discriminate between superordinate and subor- 



dinate pieces of information. Other research documents that lack of topic 
familiarity or a "script" may result in reduced comprehension and reten- 
tion of passage content, e.g., Bransford and Johnson (1972), Anderson, 
Reynolds, Schallert, and Goetz (1977). 

Similarly, in written production, a student cannot begin to compose , 
a coherenet essay without a sufficient store of facts and relations within 
a topic* Trfpical content of test passages or of writing topics can be 
differentially biased against students with particular cultural or language 
experinces (Capell, 1980; Royer & Cunningham, 1978; Baker & Quellmalz, 1980) 



Thus topic familiarity is a critical task feature in test design. This di- 
mension may be provided for in test specifications by either empirically 
verifying subjects' topic familiarity or "world knowledge" on contemplated 
topics prior to test design (treating it therefore as an antecedent vari- 
abla), or by attempting to provide some minimum topic information through 
the ftinclusiolfof pictures or graphic material. The facilitative Vole of 
visuaTs in language comprehension. has been reviewed extensively by Levin 
and Lesgold (1978). Studies in writing performance have also attempted 
to control topic information by using pictures as writing stimuli (Pitts, 
1978; Crowhurst, 1980) In addition, writing studies have found facili- 
tative effects of pictures on writing coherence an/ support for lower / 
achieving students (Baker & Quellmalz, 1980). 

3) Task type/discourse mode . Task type appears to be a* third task 

feature ignored in the design of test tasks. Cognitiyely oriented reading 

* 

and writing studies are documenting that the recurring structural features 
of the separate modes of discourse, e.g., exposition, narration, argumenta- 



48 



tion, long described by rhetorical theorists, e.g., Klnneavy (1971), are 

extracted by readers and. stored as schemata, or templates of regular "dis-r 

1 

course features. These schemata become "frames" for comprehending dis- 
course.' Research on learners' extractions of conventional discourse struc- 
tures and applications of these schemata to new text has demonstrated that 
the schemata aids comprehension of narrative and expository discourse 
(Bransford & Johnson, 1972; Stein, 1978; Anderson efc al., 1977; Meyer, 
1975). 

— — Writing-Tesearch^rl^o provides a useful vehicl^rfOT WdeTsTaliclingr - ; — 
the importance of discourse mode in the 'general problem of test design. 
Studies jhave shown that wri-ters employ different linguistic structures, 
in different discourse modes and that writers are" differentially skilled 
in producing essays in the mokes (Quellmalz & Capell, 1979; Baker & 
Quellmalz, 1980; Pr^eter.&.Padia, 1980;, Crowhurst, 1980). Whi?e research- 

s 

ers conjecture whether discourse conventions or schemata can or should 
be directly taught' (Paris, Scardamaliz, & Bereiter,n980), the cumulative 
evidence ii that the differing structural features of discourse modes place 
quite different task demands on students. Jn discussing the syntactice 
shifts across discourse modes Cooper (1979 ; suggests that different modes 
will make a difference in product-oriented writing studies. He suggests, 
for example, that writing planning might change as much as 50 percent. 

Also, in math learning, researchers have found similar performance 
sensitivity to task type and are exploring the unique and common operations 
within and between math task types. Davis and McKnight (1979) are studyinc 
the applications of frame or schema theory to mathematical learning. Brown 

, ' 49. 



'1s attempting to find common errors or "bugs". students make'ln relation 

» 

to types of math tasks {Brown & Burton, 1978). In her discussion of 
generic, meta-cogn1t1ve skills, (lays-Roth (1980) describes one component 

v 

which consists of a learner's repertoire of strategies and procedures for 
major problem types. • 

4) Structural complexity /* A fourth task feature unexpHcated 1n cur- 
rent domain specifications and highlighted 1n cognitive research -1s the 
structural complexity of passages. Work ■ discourse analysis (Kintsch, 
1974; -Meyer, 1975; Davis & Nold, 1978) Indicates that semantic structure 

involving abstraction levels, amount of Information, and number and type 

> ? 

of relationships fl rongpi eces of Information Interacts with learners' 5 

comprehension, retention, and production of that information. Thus, de- 

i * 
signing reading math, or other subject-matter passages that fail. to control 

for semantic structure could result iip performance variability due to 

reader's comprehension difficulties with passage, structure, rather than 

performance on the subject matter concepts and skills qf interest. In 

addition, specifying rules for text structure would be more likely to guide 

production of somewhat homogeneous item .pools. 

5) L inguistic complexity . Cognitive learning research suggests a 

fifth content dimension, linguistic structure. The extensive literature 

■ * 

documenting influence of linguistic complexity on comprehension supports 
the Importance of linguistic control as aji item dift^culty or readability 
factor. (See, 'for example, Kintsch, 1974; Loban, 1976; QuilHan,. 1968.) 
Some reading test specifications attempt to control this through readabil- 
ity formulae (Fry, 1968; Klare, 1963); most test specificat1ons, v however, 
ignore the role cf linguistic complexity intask design (see Pol in & Baker, 
1979). • * 



Response Limi ts 

Cognitive learning research has particular relevance for the speci- 
fication of resafonse limits in tasks. A marked failure of Idomai ri-ref er- 
enced testing *is inattention to the operations required for the student 
to react to the content. State-of-the-art test specifications tend to 
describe the response limits as constructed or selected, and specify the 
types of criteria or numbers of alternatives. ylt« does riot take a cognitive 
psychologist to realize that a greater range of different thought proces-* 
sing and problem solving goes l£tojelectin^jmjltl^e_ choice answers. for 

some tasks compared to others. - r 

' < 

1) Response mode - Learning research has long noted the performance t ' 
distinction between selected and constructed response modes (Bourn>, *1966; 
Skinner, 1954), but cognitive researchers are examining;, thfe operations 
required by the two response modes. In fact, much cognitive research . , 
elicits lengthy constructed responses fronr students during task processing, 
e.g., whi^ solving an equation, while reading, ot while planning", writing 
and revising, as well for the final response (the solution, the free recall, 
the essay). This pattern contrasts with the preference for multiple-choice 
options in actual tests. * ' ' 

Learning research in reading finds different information retained -on 
cued (often multiple choice) vs. free recall tasks (Andersan et al:, 1977). 
Writing process research is applying extensive protocol analyses to reports 
of writers 1 processes and to their productions elicited during planning, 
writing, and revising tasks, e.g. , Hayes and,Flotfer (1978, 1979), and Ber- 

eiter (1979). Writing assessment research in a domain-referenced frame- •' 

» * 

V • - 

51 



work has found very different descriptions of writing performance yielded 
by scores on multiple choice tests and writing samples (Spooner-Smith, 
1978; Quellmalz & Capell, 1979; Spooner-Smith, Winters, Quellmalz & Baker, 
1980). Math learning research finds that requiring production responses 
during problem" solving and at the point of solution provides more diagnostic 
information about achievement (Davis, 1979; Brown & Vanlehn, 1980). In 
general, these researchers seen to value constructed responses in all sub- 
ject areas- as more reflective of the status. of learner's skill development. 
This_emphasis on constructed responding suggests that classroom level diag- 
nostic tests, aimed at describing a student's competency statu?, should 
reduce dependence upoit the conventional multiple choice mode. If tests 
can include constructed response tasks for "benchmark" stages of develop- 
ing subject matter skill, Instructors would have more powerful, sensitive 
diagnostic assessment devices. 

2) Processing operations or routines . Cognitive learning research is 
beginning to test speculations about the nature and difficulty of operations 
involved in skilled performance. These studies on inferencing from tekt 
(e.g., Thorndyke, 1975; Spiro, 1975); routines involved in planning, writing, 
and revising (Shuy, 1977; Hayes & Flower, 1979); procedures for solving 
math. problems (Brown & Burton, 1977; Davis & Mcknight, 1979; Greeno, 1978); 
and operations in other subjects such as physics and chess .(Simmon & Simon,. 
1978; Chi & Glaser, 1980; Larkin, 1979) aim to identify domain-relevant 
operations. Math research has identified routines for some skills (Davis 
& Mcknight, 1979; Brown & Vanlehn, 1980; Resnick, 1980); reading and writing 
research are just beginning to identify and verify operations. 



Error analyses of processing problems or "bugs" are being conducted 
to shed light on problem solving routines. Results from these analyses 
may not only inform test design about the routines required for different 
task ...types, but also about the types .of distractors that could be gener- 

» » 

ated for many multiple.choice tasks. For example, Brown's research iden- 
tifying common "bugs" oV errors would suggest types of distractors to 1n- 
- elude on math multiple choice tests (Brown & Vanlehn, 1980). Shaughnessy, 

- description of common writing errors might suggest classes. of distractors ' 
__ forvmechanics-oriented writing multiple choice tests (Shaughnessy, 1978). 

3) Heta-cognitive strategies . A third dimension, on which cognitive 
/ research suggests the -response limits of test tasks can vary is on meta- 
cognitive, general reasoning skills. Cognitive researchers differ in their 
views on the place of these skills in subject matter test tasks. Royer 
and- Cunningham (1978) propose that general 'reasoning skills are sepearate 
t from, and should be removed from tests of reading comprehension. Brown, 
Campione, and pay (1980), however, define meta-cognitive information as that 
knowledge learners have about the state of their own knowledge base and 
^strategies available to face task demands. They describe meta-cognitive 
strategies for approaching text comprehension which seem more specific to 
reading domains than to general) strategies. 

Hays-Roth (1980) defines meta-cognitive skijlls as strategies for 
knowledge acquisition, problem solving, and reasoning that are domain in- 

- dependent* These involve the learner's assessment of a repertoire of stra- 
. tegies suitable for a problem, selecting, scheduling, executing, and eval- 

uating the efficacy of those strategies. Cognitive learning research with 



15 



with adults has correlated differences in meta-cognitive skills with 

differences in quality of learning and problem solving performance, (Chi 

& Glaser, 1980; Hays-Roth, 1980). Frase (1980) also" describes dif- 

ferences in test performance attributable to learners' inappropriate 

test taking strategies^ based on optional or partial representation 

of test tasks. x 
i 

The research on meta-cognitive or executive strategies is clearly 
still exploratory. Its potential for test design may be the identification 
of learning and task engagement strategies important for students to ac- 
tivate across many subject domain tasks. Identification of generalizable 
strategies would permit either their assessment as .antecedent variables or 
provision in test task design for partialing out or controlling for their 
relationsip to domai n- specif i\ operations. A line of current research in 
writing is concentrating upon identifying the set of meta-cognitive strate- 
gies required for competent writing (Flower, n.d.; Cooper & Matsuhashi, 1978; 
Rose, n.d.). 

Summary of Implications of Behavioral and Cognitive Learning Research for 
Doma i n-Ref erenced Test i ng 

This review of the implications of learning research for test design 
yields three major recommendations for design methodology. First, we have 
asserted that much state-of-the-art domain-referenced test development 
methodology could be vastly improved by simply improving the descriptive 
rigor of the content and response limits in domain-referenced test speci- 
fications. The content and response dimensions of test tasks should be 



54 



^described so clearly that- they provide rules for replicably generating 
items with" homogeneous content and response" components. The content and 
response dimensions so explicated should be those that research suggests 
make a difference in student performance (and, therefore, items 1 concept- 
uaVand performance homogeneity)* 

' Second, domain, specifications should include those additional task 
features thai: research demonstrates influence learning and performance. 
These dimensions include content limits that affect examinees' understand- 
ing of the task demands: (1) context (e.g., purpose, audience, relevance, 
and J time); (2) topic range (e.g., baseball games) and presentation mode 
(e.g., pictures); (3) discourse mode or task types (e.g., narration, ex- 
position, three variable algebraic equations); (4) structure (e.g., indue- 
tive, "deductive); and (5) linguistic complexity. Additional factors af- 

fecfrng examinees' responses should be included in response limits which 

> 

should specify (1) response mode (e.g., direct, indirect) including scoring 
criteria; (2) required operations and distractor or discrimination parameters 
Whether these specified task conditions do indeed. differentially influence 
performance on the particular test $an be empirically verified for a par- 

K * 

ticular test. 

^-^ 

Third,, test design methodology should attempt to control for or tailor 
tasks to the antecedent conditions emerging as important learning and per- 
formance variables. Methods for dealing with antecedent factors include 
pretesting or tailoring tasks to individual's word knowledge, schemata 
or^ cultural, cognitive predispositions to engage test tasks in identifiable 
ways. It may be that constructed response tasks will permit identification 



of the interaction of student's entering repertoire with the task at 
hand. > f 

Generally, as cognitive learning research documents the impact of 
antecedent or task conditions on student performance, test design method- 
ology should, more promptly and judiciously attempt to account or control 
for these conditions in assessment instruments. By attending to learning 
research, designers of tests (and instruction) can construct more valid, 
defensible, fair, and useful methodologies and measures. 



*7 



56 



References 



^Anderson, R. C. How to construct achievement tests to assess 'compre- 
hension. Review of Educational Research , 1972, 42, 140-170. 

« h * 

Anderson, R. C. The notion of. schemata and the educational enterprise. 
In R. C. Anderson, K. J. Spiro, & W. E. Montague (Eds.), Schooling 
and the acquisition of knowledge . Hi 1 1 sdal e , N . J . : Lawrence Erl baum 
% Associates,. 1977 .C 

. ' ' * 
Anderson, R.,£., Reynolds, R. E., Schallert, D. L. , & Goetz, E. T. 

Frameworks for comprehending discourse. American Educational Research 
Journal , 1977, 14, 367-382. 

Anderson, T. H. , Wardrop, J. L., Hively, W.> Muller, K. E., Anderson, 
R. I., Hastings, C. N., & Frederidsen, J. Development and trial of 
of a model for developing domain-referenced tests of reading compre- 
hension. Urbana-3hampaign, 111.: Center for the Study of Reading, 
May, 1978. 

Baker, E. L. Beyond objectives: Domain-referenced tjests for evaluation 
and instructional improvement. Educations! Technology , 1974, 14, 10-21. 

Baker, E. L., & Quellmalz, E. S. Issues in eliciting writing performance: 
Problems in alternative prompt j Mjjstrategies. Paper presented at the 
annual meeting* of the National' Council on Measurement, in Education, 
Boston, 1980. 

Baker, E. 1., & Spooner-Smith, L. Evaluation Response: Making Judgments . 
Los Angeles, CA: Center for the Study of Evaluation, Spring' 1977. 
r , * 

Bartlett, F. C. Remembering . Cambridge, England: Cambridge University 
Press, 1932. 

» 

Bereiter, C. Development in writing. In Testing , teaching, learning . 
Report of a conference on research on testing . Washington, D. C: 
National Institute of Education, 1979. 

Bourne, L. E. J. Human conceptual behavior . Boston, MA: Allyn and 
Bacon,, 1966. 

Bransford, J.* D., & Johnson, M v K. Contextual prerequisites for under- 
standing. Some investigation r* comprehension and recall.' Journal of 
Verbal Learning and Verbal Behavior , 1972, U, 717-726. 



57 



Britton, J., Burgess, T., Martin, N. , McLeod, A., & Rosen, H. The 
development of writing abilities .-* New York: MacMillan Education, 
Ltd., 1975. ' 

Brown, A. L., Campione, J. C, & Day, J* Learning to learn: On training 

students to learn from texts. Paper presented' at the annual" meeting 
' of the American Educationa) Research Association, Boston, 1980. 

•Brown, J. S., & Burton, R. R. Semantic grammar: A technique for con- 
structing natural language interfaces to instructional systems . 
Cambridge, MA: Bolt, Beranek and Newman, Inc., 1977. (ERIC " 
Document Reproduction Service No. ED 142 240) 

Brown, J.-S., £ Burton, R. R. Diagnostic models for procedural bugs 
in basic mathematical skills. Cognitive Science , 1978, 2, 155-192. 

Brown, J. S., & Vanlehh, KJ Towards a generative theory of "bugs ." Palo 
Alto,'CA: Xerox Palo Alto" Research Center, 1980. 

♦ 

Capell, F. Test design project: Studies in test bias, progress report . 
- Center for the Study of Evaluation, University of California, Los 
Angeles, 1980. 

Capell, F., & QuelTmalz, E. Test design project: Studies- in test bias . 
An examination of curricular relevance and curricular sensitivity 
of achievement tests in two languages . Los Angeles, CA: Center for * 
the Study of Evaluation, 1980. (Grant No. ME-G-80-1112) 

Cazden, C. B. Two paradoxes in the acquisition of language structure 
and functions. In J(. Connoly, & J. S. Bruner (Eds.), The growth of 
competence . New York: Academic Press, 1974. 

Chi, M. T. H., & Glaser, R« The measurement of expertise: Analysis of 

the development of knowledge and skill as a basis for assessing 
. achievement. In E. L. /Baker, & E. S. Quellmalz (Eds.), Educational 
Testing and Evaluation/- Design, Analysis, and Policy . Beverly Hills, 
CA:- Sage Publications^ 1980. 

Chomsky* N. Syntactic structures . The Hague: Morton, 1957. 

Cooper, C. R. Current studies of writing achievement and writirjg competence. 
Paper presented at the annual meeting of the American Educational Research 
Association, San Francisco, 1979. 

Cooper, C. R., &Mtsuhashi, A. A video time-monitored observational study : 
The transcribing behavior and composing processes of a competent high 
school wrjtpr .- Buffalo, NY: State University of New York, 1978. 



58 \ 



20 



Crowhurst, M ^ Syntactic complexity in narration and argument at three 
gradd levels., Canadian Journal of Education , 1980* 

Davis, B., & Nold, E. The discourse matrix . Palo Alto, CA: Stanford 
University, 1978. (ERIC Document Reproduction Service No. ED 168 022) 

Davis R. B., & McKnight, C. C. Towards eliminating "black boxes"; A 
v new look at good vs: poor mathematics students . Urbana-Champaign, 
IL: University of Illinois, 1979. 

Flavell,.J. H. Cognitive development . Englewood Cliffs, NJ: Prentice- 
Hall, 1977. . .. 

Florio, S. Learning to write in the classroom community: A case study. 
Paper presented at the annual meeting of the American Educational 
Research Association, San Francisco, 1979. 

Flower, L. S. Good writing: Evaluating the writer's process . Pittsburgh, 
PA: Carnegie Mellon University, n.d. (mimeo). 

Frase,,L. J. The x demise of generality- in measurement and research method- 
- ology. In E. L. Baker, & E. S. Quellmalz (Eds.), Educational Testing 
and Evaluation: Design, Analysis, and Policy , Beverly Hills, CA: 
Sage Publications, 1980. 

Fry, E. A readability formula that saves .time. Journal of Reading , 1968, 
yl, pp. 513-516; 575-578. 

Gagne, R. H The conditions of learning (1st & 3rd ed.). New York: Holt, 
Reinhart<and Winston, 1967, 1977. 

Greeno, J. G. Cognitive objective in instruction: Theory of knowledge 
for solving problems and answering questions. In D. Klahr (Ed.), 
Cognition and Instruction . Hillsdale, NJ: Lawrence EWlbaum Associates, 
1976* 

Greeno, J. G. Nature of problem solving abilities. In W. K.^tes (Ed.), 
' Handbook orLearning and Cognitive Processes (Vol, 5}. Hillsdale, NO: 
Lawrence Erlbaum Associates, 1978. 

Hayes, J. R., §■ Flower, L. * Protocol analysis of the writing process. 
Paper presented at the annual meeting of the American Educational Re- 
search Association, San Francisco, 1979. 

Hayes, J. R., & Flower, L. Writing as problem solving. Paper presented 
at the annual meeting of the American Educational Research Association, 
San Francisco, 1979. 

Hays-Roth, B. Cognitive skills for learning and thinking , Prpposal sub- 
mitted to, the National Institute of Education, June 1980. 



59 



Herman, J. , & Yeh, J. Test use:' A review of the issues. In E. L. 
JjJaker & E. S. Quellma]z (Eds-.),' Educational Testing and Evaluation , 
Beverly Hills, CA: Sage Publications, 1980, 

Hively, W. Introduction to domain-referenced testing. Educational 
Technology , 1974, 14, 5-10. - 

Kinneavy, J. R. A theory of discourse . Enijlewood Cliffs, NJ: Prentice- 
Hall, 1971. „ 

Kintsch, W. The representation df meaning in memory . Hillsdale, NJ: 

Lawrence Erlbaum Associates, 1974./ * * 

Klare, G. R. The measurement of readability . Ames, Iowa: Iowa State 
University Press, 1963. 

* ■ , 

Larkin, J. H. Skilled problem, solving in physics: A hierarchical planning 
approach. Journal of Structural Learning , 1979. . - 

Levin, J. R.,„& Lesgold,.S. N. t On pictures in prose. Educational Communi- 
cation and Technology , 1978, 26, 233-243. 

Loban, W. Language development: Kindergarten through grade twelve - 
Urbana, IL: National Council of\Teachers of English, 1976. 

Meyer, B.. F. The organization *of prose and its effects on memory. North 
r Holland studies in theoretical poetics . (Vol « 1). Amsterdam: North 
Holland Publishing Company, 1975. 

Min^ky, M. A framework for Representing knowledge. In P. H. Winston (Ed.), 
The psychology of computer vision . New York: McGraw-Hill, 1975. 

Montague, W. E. " A common flaw in research design: Inconsistency between 
learning and testing requirements. \ Paper presented at, the annual meeting 
of the^Arrferican Educational Research Association, Boston, April 1980. 

Nessier, U. Cognitive, psychology / New York,: Appleton-Century-Crofts, 1967 

*v v. - 

Nold, E. W. The writing process. Unpublished manuscript. Palo Alto: 
Stanford University* 1979. - i - 4 ' 4 ' 

Norman, D. A., & Bobrow, D. G. On" data,- limited 'and. resource-limited pro- 
cesses. Cognitive Psychology ,, 1975 , 7,, 44-64. 

Norman, D. A., & Rumelhart, D. E., & the LRN /Research Group. Explorations 
in cognition . San Francisco, CA: Freeman, 1975. 

Olson, D. From utterance to text: The bias of language in speech and 
writing. In H. Fisher, & R. Diez-Guerro (Eds.)* Language and Logic 
jn Personality and Society . New York: Academic Press: 1977. ^ 

' i 

j 

. -60 ' 



Paivio, S. Imagery and verbal^ processes . New York:/- Holt, Rinehart & ' 
Winston, 1971. \ - i * . ' 

Paris, P., Scardamalia, M., % & Befceiter, C. Discourse schemata as knowl- 
edge and as regulators of text\oroduction. - Paper presented at the 
annGal meeting of the American Educational Research Association, Boston, 
1980. o , * . ' 

Pitts, M. The relationship of classroom instructional characteristics 
and writing in the descripftive/oarrative mocfe . Report to the National 
Institute of Education. Los Angeles, CA: Center for the Study of Eval-' 
uation, November 1978 (Grant No. OB-NIE-G-78-0213). 

Pol in, L. G., & Bake**, E. L. Qualitative analysis of test item attributes 
for domai^referenced content validity judgments. Paper presented^ 
the annual meeting jpf the American Educational Research Associati'pfe 
-San Francisco, 1979£> - 

Popham, W. J. Domain-referenced strategies. In R. A. Berk (Ed.), Criterion- 
referenced measurement . Johns. Hopkins University Press, 1980. 

Praeter, D. , & Padia, W.. Effects of modes of discourse in writing perform- 
ance in grades four and six. . Paper presented at the annual meeting of 
the American Educational Research Association, Boston, 1980. 

Quellmalz, E. S. Test design: Aligning specifications for assessment and 
instruction. Paper presented at the conference Evaluation in the 80' s: 
Perspectives for the National Research Agenda. Los Angeles, CA: Center 
for the^ Study of .Evaluation, June 1980. 

Quellmalz, E. 'S, , & Capell, F. Defining writing domains: Effects of dis- - 
courseand response mode / Report to the National Institute of Education. 
Los-Angeles, CA: Center for the Study 'of Evaluation, November 1979. 
• (Grant No. OB-NIE-G-78-0213) 

A ' * *• 

< * t 

Quellmalz, E. S., & Snidman, N. , & Herman, J. Toward. competency-based' 
reading* systems. Paper presented at the annual meeting of the American 
Educational Resear^fT Association, New York, 1977. 

Quillian, M. R. Semantic memory. In M. Minsky (Ed.), Semantic \information 
processing . Cambridge, MA: MIT Press, 1968. * ~T » " 

Resntck^ L. What do we mean by meaningful learning? Invited address'at 
the annual meeting of the American Educational Research Association, 

Boston 19#0. ' , ' 

J * 

Royer, J. M., & Cunningham, D. J.^ On the theory and measurement of reading 
comprehension (Tech. Rep. No. 91), Urbana-Champaign, IL: Center for 
the Study of Reading, 1978, 



61 



23 



Rose, M. Strategies, audience, exposition and freshman's process - 
A cogni tTve/<:ontextuaHth^ory of instruction for college composition . 
Los* Angeles: University of California, Writing Research Project, n.d. 

Rudner, L. M. A short and simple introduction to tailored testing. 

Paper presented at the Eastern Educational Research Association annual 
I meeting, Williamsburg, VA, March, 1978. " 

Rumelhart, D. E., & Ortony* A. The representation of knowledge in memory. 
In R. C, Anderson, R. J. Spiro, & W. E. Montague (Eds'.), Schooling 
and the Acquisition of Knowledge . New York: Lawrence Erlbaum Associates, 
1977. 

Scribner, S., &*?ole, M. Unpackaging literacy. Social Science Information , 
19?8, 17/19-49. 

Shaughnessy^ M. P. , Errors and Expectations . New York:, Oxford 'University 
Press, 1977. % 

Shuy, R. W. Toward* a developmental theory of writing: Tapping and knowing. 
Paper presented at the National Institute of Education Conference on 
Writing^ Los Alamitos, CA, 1977. 1 

f 

Simon, D. P., & Simon, H. A. Individual differences in solving physics 

probl ems . In P.. Si egl er (Ed . ) ^ Children's thinking: What develops? 

Hillsdale, NJ: Lawrence 'Erlbaum Associates, 1978. 

Skinner, B. F. The science of learning and the art of teaching. Harvard 
^Educational' Review , 1954, 24, 86-97. 

Skinner* B..F^ Verbal behavior . New York: Appleton-Century-Crofts, 1957. 

Solomon, D. Intergroup relations in the elementary school: The effects 
of classroom environments. Paper presented at the annual meeting of the' 
American Educational Research Association, Boston, April 1980. 

Spirct, R. J. Inferential reconstruction in memory for connected discourse . 
(Tech. Rep. No. 2) Urbana-Champaign* IL: Cognitive Studies in Edu- 
cation; O.ctober, 1975. , . 

Spobner-Smith, L. Investigation of writing assessment strategies . Report 
to the National Institute of Education. Los Angeles, CA: Center for 
the Study, of Evaluation, November 1978.* (Grant No. .OB-NIE-G-78-0213) 

Spooner- Smith, L., Winters, L., Quellmal*, E. , & Baker, E. L. Characteri- 
zatfons of student writing compatence: An investigation of alternative 
' v scoring systems. Paper presented at the annual meeting of the American 
• 'Educational Research Association, Boston, April 1980. 



\ 

r 



62 



Stall ard, C. An analysis of the writing behavior of good student writers. 
Research in the Teaching of ^English , 1978, 12* 206-218. 

r 

Stein, N. L. How children understand stories: A developmental analysis 
* (Tech. Rep. No,* 69). Urbana-Champaign, IL: Center for the Study of 
Reading, 1978. ' v 9 

Thorndyke* P. W. Cognitive structures in human story comprehension and 
• memory (Tech. Rep. No. P5513.). Santa iMonica, CA-: Rand Corp., 1975. 
. (ERIC Document Reproduction Service No. ED 123 587) 

Wardrop, J. L., Anderson, T. H., HiveljL W., Anderson, R. I., Hastings, 
C.N., & Muller, E. A framework ffor analyzing reading test charac- 
teri sties . Urbanu-Chanlpaign, IL; University of Illinois, 1978. * 

Wittrock, M. C. Learning as a generative process. Educational Psychology 
1974, 11, 87-95. . , 



4t 

63 \ 



EFFECTS OF DISCOURSE AND RESPONSE MODE ON JHE 
MEASUREMENT OF WRITING COMPETENCE I 



Edys Quellmalz, Frank J. Capell, 
anu Chi h -Ping Chou 



Center for the Study of Evaluation 
UCLA Graduate School of Education 
Los Angeles, CA 90024 



Grant No, 0B-NIE-G-80-Q112 



The research reported herein was supported in whole or 
.in part by a grant to the Center for the Study of Eval- 
uation from the National Institute of Education, U.S. ' 
Department of Education. However, the opinions and find- 
ings expressed here do, not necessarily reflect the position 
pr policy of NIE and no official NIE endorsement should 
be inferred. , 



V 

63 



' EFFECTS OF DISCOURSE AND RESPONSE MODE ON THE! 

•t > 
l 

MEASUREMENT OF WRITING COMPETENCE I 

As school- district and state assessment programs attempt to test' 

student basic skills achievement, attention to the methodological prob- 

/ 

lems inherent in Measuring writing competence increases. The complexity, 

of writinq as a skill domain and the lack of consensus about its com- 

* / 
ponents have engendered much controversy about the important character- 
istics of writing tasks. There is little agreement about^ the type, * 
length or number of tasks that should be administered for given test 
form and even about whether 5ome aspects of composition rejquire "direct" 
assessment through the elicitation of writing samples'. \ 

\ 

Two salient measurement issues involved in specifying writing task 
types are the response mode (selected vs. constructed) and the discourse • 
mode required by the tasks. Conventionally, large scale assessments have 
dealt with the response mode issue by measuring waiting skills indirectly 
with multiple choice tests. Support for such indirect measurement derived 
from reported high correlations between objectiv6 tests and direct \heasures 
of written production, from the erratic reliabilities, accompanying impres- 
sionistically scored essays and from the economic and logistical demands < 
of collecting and scoring writing samples (Braddock, Lloyd- Jones, & Schoer, 
1963; Godshalk, Swineford '& Coffman, 1966; Ireland, 1977). More recently, 
however, demands for writing tests with content, construct and ecoloyical 
validity have prodded reinstatement of direct, written production tasks. 



Yet even when assessments collect writing 'samples* students usually produce 
only one composition, despite the well documented fluctuation of writing per- 
formance from one sample to the next (French, 1962; Braddock et al., 1963; 

. Diederich, 1974)r By necessity, a single sample taps student performance on 

/ . 

only one type of discourse and J on one topic* .This limitation presents a 

! . 

measurement problem since the gelneric methods of development in partic- 
ular forms of discourse differ substantially from one another* Salient 
structural features of argument,, for instance, are issues, ^reasons, and 
conclusions, while stories include plot, character, setting, and th^foe. 
Exposition involves main idfea, supporting detail and logical development; 
narrative elaborates events in chronological order, and description por- 
trays concrete details in spatial order* Since different purposes set for 
writing tend to elicit the generic structural elements of the modes of 
discourse (Kinneavy, 1971), it seems likely that the schema 'or frames . 
activated in the writer by writing tasks varying in purpose should differ 
(Anderson, 1977; Minsky, 1975)* 

Research on Discourse Mode Effects 

Evidence from reading and writing research support the distinctive- 
ness of processes required by -varying discourse modes* Reading research 
suggests that different schema are used as students attempt to comprehend 
narrative and expository text (Meyer, 1975; Graesser, Hauft-Smith, Cohen 
& Pyles, 1979). As in reading, writers also employ various skills and 
personal resources to meet the demands of £ kintf of writing or mode of 
discourse. For students learning to write, these different discourse 

. 66 



modes represent very dissimilar challenges and, furthermore, attempts 
to compare student writing skills across modes of discourse constitute 
a real assessmeat issue. In one of the most frequently cited writing 

w 

assessment studies, the performance variability in■ ,, topic ,, discussed 
♦ 

by Godshalk', et al . was, probably, equally or more attributable to the 
five different discourse modes stimulated by the assignments than to the 
subject matters addressed. Veal and Tillman (1971) reported variability 
in elementary students' performance on tasks specifying different dis- 
course aims as did Praeter and Padia (1980). 

Moreover, other writing research demonstrates that different writing 
purposes stimulate writing that varies in structural complexity (Crowhurst 
& Piche, 1979; Crowhurst, 1980; Perron, 1977) and that represents writing 
topics quite differently (San Jose, 1973; Perron, 1977). Most importantly 

-for instruction and evaluation, this accumulating body of writing research 
suggests that different writing purposes require dissimilar writing strat- 
egies 6f varying difficulty for individual students. Cooper (1979) has 
cited research indicating that sentence structures shift when discourse 
mode changes and* speculated that a student's planning demands for an essay 

1 might change as much as 50%. 

. The implications of reading and writing studies on discourse mode 
effects for writing assessment are that the mode of discourse of the 
writing purpose will make a difference in writing performance. For ex- 
amples, students might be more skilled at narrative writing tasks re- 
quiring chronological development than at expository tasks. requiring 
logical development. Thus the profile of writing competence for a stu- 



67 



0 



4 



dent derived from a writing test calling for exposition may differ from 
the profile of writing competence for that same student derived from a 
narrative or persuasive task. 

Research on Response Node Effects 

In addition to the question of skill commonality across discourse 
modes or genres, the question of the response mode or measurement form 
in which the writing skill should be assessed continues as a hotly de- 
bated topic. While many claims are made for the predictive or concurrent 
validity of indirect, objective writing measures (Coffman, 1971; Breland, 
1977), indirectj^asures simply are nol considered by writing researchers 
to meet the more crucial standards of content or construct validity (Brad- 
dock, et al., 1963; Cooper S^Odell, 1977). Learning tneory and research 

v 

suggests that selected responses elicited by multiple choice tests provide 
some valuable information, but are, after all, measures of processes re- 
quired in reading comprehension* not measures of actual* production ability. 
As such, recognition tasks are often considered, at best, behaviors enroute 
to constructed responses (Bourne, 1966, Skinner, 1963). 

Comparisons of direct and indirect writing measures have yielded 
moderate correlations between the scdres resulting from the two response 
modes. In one of the seminal writing assessment studies, Godshalk, at al . 
- (1966) reported correlations from .46 to .75 betv/een the sum of 5 essay 
scores from high school students and their College Board English Compre- 
hension Test. In an attempt to validate ETS's Test of Standard Written 

K 

English, (TSWE) Breland, Conlan and Ragosa (1976) found correlations of 



* 68 



only .42 between that mechanics-oriented test and a 20 minute essay scored 
on a 4-point scale. In a subsequent study, Breland and Gaynor (1979) re- 
ported correlations ranging from .58 to .63 between students 1 three sep- 
arate essay scores awarded on a 6-point scale and the TSWE, wfri-le the 
correlation between the sum of the three essays and the TSWE was .76. 
Similar low to moderate correlations of .43 to .67 ^ere. found in a ' 
comparison of the American College Testing Program's' English Usage 
Test (also emphasizing sentence-level skills) and students'- scores on 
three essays. Hogan and Mishler's (1980) sjbtfdy off the relationship 
between third and eighth grade students 1 Metropolitan Achievement Test 
scores and one essay yielded correlations of .68 and .65, while the cor- 
relations increased to .75 and .81 when a second essay score entered into 
the calculations. 

In general, the methodology of these studies has related essay scores 
derived from norm-referenced holistic ratings to a total multiple choice 

test score. The^apparent assumption was that both sets of measures tapped 

> 

the same set of writing skills. However, content analyses of the essay 
rating criteria reveals thatt they were often vaguely worded, but did refer- 
, ence whaTe^-text features such as thesis, coherence, support and style, as 
well as sentence-level mechanical conventions. Items on the multiple choice 
tests, on the other hand, often Emphasized sentence-level mechanics £nd 
required few if any text-level discriminations and, obviously, no produc- 
tion responses. At issue then is not simply whether measures correlate 
statistically, but whether different measures focus on the^same text fea- 
tures of written productions and whether they reflect the same underlying 



ERIC . 60 



skill constructs. 

Design Requirements for Comparing Alternative Measures of the Same Construct 

To compare the information yield and psychometric quality of writing 
measures involving different discourse and response modes, data are 
needed that contrast the performance of a group of examinees across equiva- 
lents specified skill domains varying only by the modes of measurement* y 
The specific test objectives (skill domains), stimulus dimensions* instruc- 
tions to examinees, and response criteria/characteristics need to be matched 
as closely as possible across discourse and response modes; i.e., each of 
the measures should present parallel content-valid procedures for assessing 
the same skill or skills. Data from measures designed to be psychologically 
parallel can then be examined in terms of the comparative construct validity 
and reliability of the discourse and response mode variables. A test of writ- 
ing researchers' contentions that text-level writing skills such as thesis 
statement, organizationation and support are best measured by written produc- 
tion tasks would involve comparing multiple choice test subscale scores 
measuring comprehension of passages for such subskills as organization, 
support, and mechanics with ratings of these features in text the student 
produces. This paper reports such comparisons for score profiles obtained 
from analytically scored direct assessments of student writing (essay and 
paragraph length writing samples) and an indirect assessment (multiple choice 
questions concerning prose passages). Measures were designed to be concept- 
ually parallel by using the domain specifications that guided production of 
directions/ prompts for the writing tasks to construct the passages employed 
in the multiple choice task. Similarly, the dimensions of writing quality 

70 



making up the analytical scoring rubric applied to the writing samples 
determined the specific aspects of the prose passages the multiple choice 
measure questioned. 

The measurement issues addressed in the study concerned the compar- 
ability of writing scores obtained from tasks varying in discourse and 
response mode. The study departed from more conventional methodology in 
two respects. First, the measures of writing skills in the different 
response modes were specifically designed to present tasks parallel on 
all dimensions but the discourse and response mode variables. Second, 
the study augmented the standard correlational comparisons of the measures 
with'a multitrait multimethod (MTMM) analytical technique designed to 
identify the factor structure underlying student's writing score profiles 
derived from the alternative measures. These analyses examined the vari- 
ance of the writing scores derived from different discourse and response 
modes and the comparative discriminant validity or distinctiveness of 
the sets of scores across mode variations, treating the scale scores com- 
prising the writing profiles as "traits," and the discourse or response 
modes as "methods" (Campbell & Fiske, 1957). Correlations among differ- 
ent operational izations of the same variable should arise from the influ- 
ence of a single common factor or trait (e.g., organization). Also the 
method of measurement should exert an influence on each variable, so that 
variables measured by a common method will covary to a greater degree 
than those measured by different methods; this covariation ^n be thought 
of as reflecting the operation of a corranon method factor. ^Phis formulation 
of the MTMM approach has been implemented in other empirical studies 



71 



8 



through confirmatory factor analysis (Joreskog, 1974; Traub & Fisher, 1977; 

i 

Werts, Joreskog & Linn, 1972)* • Traub and Fisher (1977) for example, com- 



" l#, eskog & Linn, 1972). - Traub and Fisher 11977} for example, com- 
pared verbal and quantitative scores derived from fill-in, right/wrong mul- 



tiple choice and partial knowledge multiple choice response formats in 
exactly this fashion. - 

Two studies attempted to compare direct and indirect measures of 
reasonably parallel text features. Spooner-Smith used domain-referenced 
skill specifications to design multiple choice items analogous to essay 
rating criteria. She found correlations of the multiple choice test, 
total score with a General Impression score of .65 and with the total of 
analytic ratings of .61. Relationships between analogous features such 
as Organization and Support, however, were much lower, ranging from .23 
to .55. Her findings suggested that when multiple choice scores and essay 
scores derived from precisely matched definitions of text features, the 
comparability of scores on these component writing skills might be even 
lower than those previously reported. 

In this study we asked 1) whether student writing performance profiles 
are comparable for tasks differing in discourse mode (writing purpose), and 
2) whether tasks requiring different response modes (paragraphs., essays, 
vand multiple choice items) provide the same type and quality of information 
^sbout student writing cornpetei.ee. In the MTMM framework, we examined 
whether distinctive common factors underlay the corresponding variables 
from the writing p jfiles derived from the discourse and response modes 
variations. 



ERIC v 72 



Method 

To examine the relationship of writing scores yielded by tasks- dif- 
fering on the two variables, discourse and response mode requirements, 
high school students received writing tests on three separate occasions. 
Each student^jpetfeived 3 multiple choice test and a paragraph writing task 
as well as two full length* essay assignments. Ratings of the essays and 
paragraph on an analytic scale and scores on the objective test provided 
the bases for the comparisons. 

Sample 

Approximately two hundred eleventh and twelfth grade students frofn 
three high schools in a small school district in the Los Angeles area par- 
ticipated in the- study. Students were selected who were attending English 
or composition classes that were judged by teachers to contain average or 
above average pupils. Scores from the verbal portion of the Differential 
Aptitude Test were available for 92 students in the sample; the mean per- 
centile score for this subsample was 63.9 (s.d. = 28.6). 

Design 

Students within each class were randomly assigned to one of four test- 
ing conditions defined by the relationship of the discourse mode(s) of 
the essay tasks. In Conditions 1 and 2 (Same Genre), the three constructed 
response writing tasks (two essays and one paragraph) were in the same 
discourse mode. Condition 1 students wrote two expository essays and an 
expository paragraph; Condition 2 students wrote two narrative essays and 
a narrative paragraph. 



73 



In Conditions 3 and 4, (Different Genre) students wro.te one narrative 
and one expository essay. Condition 3 students wrote an expository essay 

on Topic A and a narrative essay qn Topic B, while Condition 4 students 

/ 

wrote an expository essay on Topic B and a narrative essay on Topic A. 
Half of the subjects in Conditions 3 and 4 wrote an expository, paragraph, 
while half wrote a narrative paragraph; 1 

Response mode, the second factor, was a within-subject factor and con- 
sisted of the multiple choice test (selected response), the paragraph (short 
constructed response) and the essay (long constructed response). During the . 
three testing occasions subjects received the multiple choice test and para- 
graph on one occasion, and an essay on each of the other twb occasions. The 
design counterbalanced' the order in which students received the tasks. 

/ 

pleasures 

The essay and paragraph tasks were constructed in accordance with a 
set of domain specifications for expository and narrative writing. These 
specifications included the purpose of the writing assignment, guidelines 
for appropriate topics, the response criteria by which written products 
were to be judged, and guidelines for the content and. format of the directions 
for the tasks. The response criteria were chosen to reflect the discourse 
features of an analytic scoring system developed at UCLA (Pitts, 19/8; 
Spooner-Smith, 1978; Winters, 1978; Quellmalz, 1979). The Version of 
the scoring system used in this study generated five ratings for. each 
written product; 

(1) General Impression — A global judgment of writing. quality 

assigned by raters after a quick initial reading of tho writing 
sample, 

74 




, j ■ • • • "' 

(2) Focus — The extent to tyhich the 'subject and main idea of the 
•' writing sample were clearly stated pr implied. 

(3) Organization — The extent to which the main idea was developed ' 
t ■ according to a discernible method of organization (e.g., clear 

chronological .or logical development). 

(4) Support The extent to which generalizations and assertions 
were supported by specific, relevant, subordinant statements. 

(5) Mechanics — The extent* to/ which the writing sample was free 
from intrusive sentence-level mechanical errors (i.£., usage, 
sentence construction, Spelling, capitalization and. punctua'tion). 

Each essay and 'paragraph was assigned ratings on these five' subscales by 
one of two pairs of trained raters, the median general inability coefficients 
for the two rater pairs were .61 and .83 across topics/occasions and subv 
scales. The three writing samples representing direct measurement (two 
essays and one paragraph) generated 15 subscale scores, each on a one (low) 
to four. (high) scale. The -scores were calculated by averaging the scores 
assigned by both raters to each essay and 'paragraph for each subscale. 

The stimulus -attributes from the specifications for the, writing tasks . 
were used to develop the passages to be read in the multiple choice task. 
Ten passages were constructed, five expository an<J five narrative. Fof* each 
passage, there were three questions, designed to be analogous to text features 
included in the rating scales — main idea (focus), organization, and support. 
Main idpa questions were referenced to a stated generalization near the begin- 
ning or end of the passage. Organization questions required the selection of 
a new sentence that would best fit, at a point in the passage marked by an 
arrow. Support questions asked which new sentence would ; best support the main 
idea of the passage. < ' 

75 



Results 

Discourse Mocle Effects 

The first set of analyses compared, students 1 scores according to the 
discourse mode of the task. , * - 

Table 1 presents means and- standard deviations of* essay ratings for 
each of the four test conditions. 



Insert Table 1 here 



On all five subicales and on total essay scores,, narrative ratings 
were lower than expository ratings. This finding may be due to the dif- 
ferential curricular emphasis given to narrative and expository writing 
in^the high schools, to subjects 1 lack of knowledge at a personal exper- 

ience. level with information required to deal with the narrative topics, 

* \ 

or to raters 1 tendency to score narratives more stringently. 

\ • * 

Table 2 displays the correlations between students 1 two essay scores 

on each of the analytic scale subscales. As expected, correlations be- 
tween essay scores of students writing two essays in the same discourse 
mode (Same Genre) are higher than those Of students in the Different Genre 
conditions. 



Insert Table 2 here 



An examination of individual subscales across conditions suggests 
that General Impression and Organization seem to differentiate fnost be- 
tween the different discourse modes. This finding might also be ^xpected, 



76 





' Table 1 • 

i „ 9 

Means and Standard Deviations of Essay Scores 



Test 






Same 


Genre* 




Different Genre 


! 

/ 


Condition 






I 


2 




3 




4 




Topic 




A 


B 


A 


B 


A 

A 


r> 

B 


A 

A 


> 

B 


General 


X 


2.20 


2.01 


1.27 


.88 


1 .4j 


1 Oft 

1.89 


.94 


2^09 


Impression 


sd 


.60"* 


.70 


1.44 


1.09 


1.89 


.66 


.88 


.54 


Focus ^ 


A 


9 At;' 


9 


C.CL 


9 91 
C.CL 


• 

2»28 


2.19 


2.26 


2.33 


sd 


' .69 


.64 


-.68 


.64 


.59 


.69 


.51 


.53 


flpfi;* n l 7 A tinn 
\ji yaii i /.a u I uii 


X 
sd 


2.16 
.72 


1.98 
' -70 


1.88 
.98 , 


1.62 
.71 

> 


1.88~ 
.85 


1.95 
.70 


1.52 
.56 


2/08 
.56 


Support 


X 


2.34 


2.38 


2. .42 


2.19 


2.36 


2.28 


2.10 


2.26 


sd 


.64 


.54 


.81 


.66 


.70 


.65 


.54 


.53 


Mechanics 


X / 
sd 


2.35 
.64 


2.31 
.68 


2.55 
.69 


2.42 
.67 


2.28 
.80 


2.03 
.81 


2.35 
.56 


2.42 
.54 


Total 


x ; 


11.50 


10.90 


10.35 


9.32 


10.23 


10:31 


9.18 


11.19 


sd 


2.78 


2.67 


3.84 


2.89 


3.36 


3.02 


2.02 


2.05 


• n = 




40 


40" 


39 


39 


40 


40 


54 


54 



•n 




Table 2 



.-~\ Correlation Between Students' Two Essays 



Same Genre 



Condition 1 



Different Genre 



General 

Impression .56 

» r 

Focus .43 

Organization .42 

Support .27 

Mechanics .58 

Total ' .60 




.08 
.37 
.10 
.38 
.50 
.22 



n = 



40 



39 



40 



54 



78 



since General Impression requires a judgment about the global quality of the 
essay as an example of exposition or narration. Therefore the constel- 
lation of essay factors influencing this judgment should be the most com^ 
prehensive for each discourse mode and thus the most discriminating. 
Structurally,, exposition and narration differ dramatically in their, charac- 
teristic use o* logical or temporal organizations, respectively. On the 
Mechanics subscaT^-eorrelations across conditions are most comparable, 
reinforcing the notion that the constellation of syntactic, punctuation, 
spelling and usage skills may not vary between modes of discourse so much as 
text-level skills do. . ■ ( * 

To test the statistical difference between students 1 scores received 
in the same and different discourse mode conditions, the correlations in 
Table 3 were transformed to standardized scores, average standardized scores 
were calculated for the same and for different genre conditions, and these 
average standardized scores for same and different conditions were contrasted 
'Table'3 presents the results of this analysis. ! 

i 

Insert Table 3 here 

The comparison reveals that the realtionship between student's two 
essay scores on General Impression, Organization, and the Total is signi- 
ficantly stronger when students write in the same genre than when they 
write in different genre. / 

Comparisons between discourse modes on the paragraphs were conducted 
only at the* group level, since an individual student wrote only one para- 
graph* Table 4 presents the results of these comparisons. The analytic 

7,9 < . 



t N 



13a 



Table* 3 

« 

Comparison Between Essay Scores 
Same and Different Genre Testing Conditions 





Same G 


anre, 




Different Genre 






Condition 1 


o 
C 




3 


4 






General 


z 


z 


2 ave. ^2 


z 


z 


, z ave. 34 


1 

observed 


Impression 


.633 


.412 


.5225 


.343 


-.08 


.1315 


2.516* . 


Focu£ 


.460 


.829 


.6445 


.436 


.388 


.412 


1.496 


Organization 


.448 


.448 


.448 


.203 


-.100 


.0515. 


2.551* 


Support 


.277 


.288 


.2825 


.070 


."400 


.235 


' .306 


Mechanics 


.662' 


.829 


.7455 


.741 


.549 


.645 


.647 


Total 


.693 


.618 


.6555 


.436 


.224 


.33 


2.095* 


n = - 


1 40 


39 


79 


40 


54 


94 





*p<.05 



• « 



1 



.scales for narration and exposition were used for rating the paragra^ . 

Insert Table 4 about here >y 

Subscale scores ranged from 1-4, total scores from 1-20. t Ratings of riar- 

* j 

rati ve .paragraphs were generally lower. than ratings of expository para- 
graphs. , Ratings of eixpository paragraphs differed significantly from 
Narrative paragraphs on the General Impression, Focus, and Organization 
subscales and on the Total scores, suggesting that as a population, these 
high school students wrote more skillfully in the expository mode. Con- 
sonant with essay data, Mechanics and Support were not as influenced by 
the different discourse tasks. 

Multiple choice test comparisons of interest were the soqres each in- 
dividual' received on the narrative and expository sections of the exam* 
Table 5 presents the means and standard deviations. 

Insert 'Table 5 about here 

On this reading comprehension test of items measuring recognition of 
writing-related skills, students were able to answer Focus/Main Idea and 
Support questions similarily well for both expository and narrative passages 
On Organization questions however, students had more difficulty in general 
(73% ovr all average), particularly with narrative organization (66%). 

Table 6 displays the correlations between individuals 1 expository and 
narrative scores on the three multiple choice subscales. 



/ 



Table 4 



Differences Between Scores on 
Expository and Narrative Paragraohs 



Expository Paragraph 



Narrative Paragraph 



General 
Impression 



Focus 



Organi- 
zation 



Support 



Mechanics 



Total Score 





1.93 




1,05 


s.cu 


71 *1 
./l ' 




1.06 


t 


6.74*** 


• 




X 


2.34 




2.09 


s d 


.54 






t = 


2^93** 






X 


1.95 




1.70 


s.d. 


.62 




. -78 


t = 


2.41** 






X 


1.94 


^ 


2.02 


s.d. 


.67 




.71 


t = 


.87 






X 


2.35 




2.29 


s.d. 


.65 ' ■ 




.74 ' 


t = 


.65 






X- 


10.50 




9.15 


s.d. 


2.66 




2.93 


t = 


3.37** 






n = 


111 




n = 89 



\ 



82 



Table 5 



Means and Standard Deviations of Multiple-Choice Test 



Expository Narrative Total 

Focus X 4.61 (92%) 4.50 (90%) 9.13 (91%) 

* s.d. .77 .89 1.38 

Organization X « 4.05 (81%) 3.31 (65%) 7.39 (73%) 

s.d.s ' 1.03 1.13 2.03 

Support X 4\*56 (91%) 4.41 (88%) j 8.97 (90%) 

s.d. .77 .98 1.51 

Total X 13.23 (88%) 12.23 (82%) 25.33 (84%) 



s £ :d. , 2.02 2.43., 4.26 



n = 241 



14b 




. .-*\ 

ERIC . , 83 



r 

Insert Table\(> here 

While all correlations are statistically significant, the correlations 
between the narrative and expository subscales are substantially lower 
(r=.47 to .49) than the correlations between the total expository and 
narrative scores (r - .65). 

In the aggregate, the preceding analyses constrasting students 1 per- 
formance on writing tasks differing in discourse mode suggest that: 

1) students 1 writing skills-vary in the different discourse modes, and 

2) discourse mode score variability seems to be differentially distributed 

* 

across writing subskills. These results occurred in separate analyses of 
students 1 essays, paragraphs and multi-choice scores. 

To test the effect of discourse mode on subskills, the, data were then 
subjects to multi-trait multi-method (MTMM) analyses using confirmatory 
factor analysis techniques (Joreskog, 1974). Traits were defined as the 
writing subskills; methods were defined as exposition and narration. The 
analyses required within-subject measures of trait and discourse mode, 
therefore data from 94 subjects in test conditions 3 and 4 were used to 
examine the effects of discourse mode. In these conditions students 

wrote an expository and narrative eessay and answered multiple choice 

/ 

questions about expository and narrative passages. Since students wrote 
only one paragraph, paragraph scores Nere^not included in the analyses. 

To examine the factor structure of discourse modes in essay perform- 
ance, the five analytic subscales of General Impression, Focus, Organization 
Support and Mechanics formed the trait dimension. These five trait scores 
were composed by averaging scores over raters and Standardizing across 



Table 6 



Correlations Between Expository and 
Narrative Multiple Choice Scores 







Expository Subscales 




Narratiye 
Subscales 


Focus 


Organization 


Support 


Expository 
Total 


Test 
Total 


Focus 


.47 z ' 


-32 


...41 


.50 


.64 


Organization 


.30 . 


.48 


.33 


.49 


.73 


Support 


.43 


.35 


.49 


.53 


.64 


Narrative 
Total 


.50 


.52 


.51 


■ .65 


.88 


Test Total - 


-.64 


.73 


.63 


.86 





n = 241 




topics. Thus ten scores -were constructed, two each (expository and 
narrative) foe each of the five subscales. 

To examine *the factor structure of discourse modes across response 
modes, a second set of analyses used just the three subcales (traits) 
common'to the essay and multiple choice tests, Focus, Organization, and 
Support. For the multiple choice test, number correct scores were formed 
within discourse mode for each of the three multiple cfioice subscales. 
A trait score for each discourse mode was then formed by standardizing 
scores across the two response modes (essay and multiple choice). The 
standardization v/as aimed.at removing interactions between response mode 
and genre. This second set of analyses employed six scores, two ,each 
(expository and narrative) for Focus, Organization ?Htd Support. .Table 7 
describes the variables and their abbreviations. 



Insert Table 7 about here 



All analyses were based on correlation matrices computed for the 
variables : in Table 7. Maximum likelihood estimates of the parameters of 
the MTMM confirmatory factor analysis models were obtained from the LISREL 
computer program for the analysis of covariance structures (Joreskog, 1973, 
1977; Joreskog & Sorbom, 1978). Thfc HSREL program allows the analyst to 
treat model parameters (e.g., factor loadings or factor intercorrelations) 
in. one of three ways: (a) as free parameters to be estimated by the pro- 
gram; (b) as fixed parameters specified in advance to equal some fixed 
number,' (usually zero); or (c) as constrained parameters to be estimated 
by the} program subject to the constraint that they equal other estimated 



86 ' 



•TABLE 7 

Description of VariaMes in LISREL Analyses 
of, Discourse Mode Effects 



Analyses For Essay Data 



Method Variables 
Trait Variables Expository . Narrative 



General Impression 


GIE 




GIN 


Focus . 


FI 




FN 


Organization 


OE 




ON 


Support 


SE 




SN ' 


Mechanics 


ME 




MN 



Analyses Pooling Essay and Multiple Choice Data 

Method Variables 

Expos i tory Narrative 

FE FN 
OE ON 
SE SN 



Trait Variables 
'Focus 

Organization 
Support 



87 



parameters. In addition, the program computes standard errors for all 
free and constrained parameters, as well as an overall chi square test 
of the model's fit to the data. All model equations (in LISREL notation) 
are of che-,form: 

*' ' Y = A Y c+e (1) 

=A y^A y + 6 (2) 

Each observed score in Y depends on the latent? variables c and e, cormion 
factors and measurement errors, respectively. * Equation (2) shows the hy- 
pothesized structure underlying the covariance matrix of the Y's; it con- 
sists of a matrix of factor loadings A v (hereafter denoted Lambda), the 
covariance matrix^of the c's, ^ (hereafter Psi), and a (usually) diagonal 
matrix of error variances. Interest in the analyses to follow focuses on 
the contents of Lambda and Psi. 

The first set of analyses examined the influence of discourse mode 

on writing trait scores derived from essays. The correlation matrix of 

a 

the transformed scores appa^r's in Table 8. Figure 1 displays a path 



InsewrffTable 8 here 



* K 

diagram for a model specifying the 'ive trait subscale factors and the 
two discourse modes/method/ factors. 



Insert Figures! here 



Each test score was assumed to be affected by a discourse mode, either 
narrative or expository, and a writing subskill. Since writing skill migtft 
be affected by other factors like 1.0. or reading ability, the five traits 



ERIC ' • . 88 



TABLE 8 4 . 

Correlation Matrix of Scores Containing/Discourse Mode and Subscale Trait Effects 





GIN 


GIE 


FN 


a FE 


ON. 


OE 


SN 


SE 


MN 


GIN 


1.000 






a 






- 






GIE 


.380 


1.000 


• 


■ 












FN 


.608 


.308 


1.000 














FE 


.306 


.523 


,327 


1.000 












UN 


.0/ c 


1 7/1 

. 1/4 


con 




l.UUU 










OE; 


.087 


.502 


.108 


.456 


.007 


1.000 








SN 


.609 


.301 


.474. 


.060 


.600 


.116 


1.000 






SE 


t 230 


.571 


:066 


' .461 


-.016 


.478 


.087 


1.000 




MN 


.411 


.482 


..351 


.222 


.286 


.125 


.322 


,297 


1.000 


ME 


.198 


^642 


.083 


• .386 


.015 


.354 


.109 


.410 


.545 



H 3 narrative GI =* General Impression 

*E = expository F ~ focus * 

0 = Organization 

S = Support ' I 

M s Mechanics 



9 



or writing skills were assumed to be distinctive, but interrelated, while 

the two discourse modes were assumed to be independent. With the effect 

* 

of /each trait constrained to be the same on the two. discourse mode test 
scores, the estimates- of the parameters and their standard errors (in par- 
entheses) are summarized in Table 9. 



Insert Table 



9 here 



The chi-square value of 19.94 yields c| non-significant probability of .46, 
This result indicates that the observed test scores fit this model ade- 
quately from a global point of view. 

Examination of all -the LISREL estimates suggests that a better model 
might be formed to fit the data. Loadings of test scores on their cor- 
responding traits range 'from .194 for the Organization subscale to .726 
for the Mechanics subscale. The test of significance for the loading of 
each observed score on trait shows that traits have a significant effect 
* on their corresponding test scores with the exception of the trait of Or- 
ganization, The correlation .amonc; all traits, or psi-coefficients, indi- 
catfes that three coefficients are abnormally high^ 1 They are the corre- 
lations between General Impression and Organization, between General Im- 
pression and Support, and betwtan Focus and Organization, as indicated by 
the values of 1.378, 1.178, and 2.106, respectively. The'se high corre- 
lations indicate that the corresponding trait^ are hj^lfly intercorrelated 
and are not distinctive from each other. These high correlations might 
absorb a large portion of measurement error for Model I ano result in a 
small chi-square value and non-significant probability. Therefore, a 



Theoretically, correlations of such magnitude are impossible. These values 
may be an artifact of "overfitting" too many variables to the models and 
suggest the need for a new model with fewer factors. 



ERIC 



92 



- * r . 

TABLE 9 

LISREL Estimates for Discourse Mode and Traits Effects 

Model I 



GI 



FO 



OR 



SU 



ME 



Narrative 



18a 



EX 



GIN 
GIE 
F N 
F E 
0 N 
0 E 
S N 
S E 
M N 
M E 



GI 
FO 

ok 

SU 
ME** 



.629 
(.109 ■ 

' .629 
(.109) 

• 0 
0 
0 

C 
0 
0 
0 

° 0; 



GI 



.580 0 
(.115) 



.580 
(.115) 



r 



0 

V 

0 



.194 
(.321) 

.194 
(.321) 

J) 

0 



0 
0 



0 
0 
0 
0 



.374 
(.170) 

.374 
(.170) 

0 . 



FO 



OR 



SU 



1.0 

.834 
( -140) 

1.378 
(1.814) 

1;178" 
( .341 ' 

0.773 
( .125) 



1.0 , 

2.106 
(2.986) 
.552 



1.0 



.980 1.0 



0 
Q 
0 
0 
0 
0 
0 

0 . 



./26 
(.099) 

.726 
C099J 



ME 



( .321). (1U40 
.429 ' ^.663 
( .190) ( .953) 



.592 
(.130) 



.493 
(.133) 

_ 0 



.822 
(.130) 



.656 
(.141) 

0 ■ 



.243 
(.124) 

• 0 



0 

.518 
(1.33) 

0 



.508 
(.146) 



.685 
(.167) 



.557 
(.157) 

0* 



.433 
(.130) 



.693 
(.297) -1.0 



20 = V. 



9 

ERIC 



•93 



refined model is required to fit the structure of the data. 

A refined model with reasonable psi-coefficients requires some modi- 
fications on the CSE s'Qbscales. Signaled by information from psi corre- 
lations, we referred to the definitions of the CSE subscales (traits). 
First, the trait pf Gemeral Impression is a global* rating encompassing, 
but not Timited to the other subscales. The psi coefficients of- General 
Impression'with other traits ranging from -773 to 1.378 suggest that GI 
* is highly correlated with each , of the four specific traits, supporting 
the global characteristic of GI. Second, "the traits of Focus and Organi- ' 
2at^on have common elements in their definitions. Both traits are basic- 
ally dealing with the construct of coherence, the logical relationships * 
amopg ideas in the essay. -Since the GI trait is«not so distinctive from 
the others and contributes little information beyond the four specific - 
traits,, it seemed advisable to exclude the GI trait from the model. Model 
I also supports the advisability of collapsing the traits of Focus and 
Organization into a single trait of Coherence. ^ 

Figure 2 presents the refined model. Instead of five traits, the , new 
model consists of only. three traits: Coherence, which combines Focus and 
Organization, and the traits of Support and Mechanics- The LISREL esti- 
mates for Model II, summarized in Table 10, show that the value of chi- 

Insert figure 2 here 
square, 16.Q28, again yieids a non-significant probability pf ,248. Wtth 

Insert Table 10 here 



94 



« TABLE 10 

* ' * • 

% 

LISREL Estimates **>r Discourse Mode and Refined Trait Effects 
0 ' ^ 





G 


s , 


5 if 


N ' 


,' E 


FN 


.559 * 
(.119) 


0 


0 


.595 
(.131) 


.000 


FE 


.559 
» (.119) 

\ ***** / 


" 0 


i'Q ' 

. * « 


.000 . 


. .580 
(.146) 


ON 


.348 
(.139) 


d 


0 


.789 
( .125) 


.000 


OE 


.348 


. o ■ 


0 


.000 , 


.599 


SN 


0 


.294 
(.195) 


0 


.739 
(.145) 


.000 


SE 


0 


.294 
(.195) 


0 


.000 


.667 
(.162) 


.MN 


0 


0 

(.097) 


.742 


.285 
(.122) 


' .000 


■v 
ME 

c 


. 0 


0 


.742 
(.097) 


.000 


. .320 
(.133) 



c 

s 

M 



1.000 

.311 
(.462) 



' .413 
(..197) 



,2 - 

p = 0.248 



x 13 = 16 • 028 



1.000 



~^854 1.000 
(.476) 



b 

96 



Table 11 

Discourse Mode 
Correlation Matrix for Model Jl'i Pooling Essay and 
Multiple Choice Test Scores 





FON 


' FOE 

« 


ORN 


ORE 


FON , 


1.0 




• 




FOE 


.647 


1.0 

t 






ORN 


.661 


.455 


- 1.0 


V 


0RE_ 


.567 


.622 


- '.590 • 


1.0 


•SUN . . 


. .518 


.488 


-.454 


.416 


S£JE 


.504 


- .650 


.409 


*568 



SUN 



1.0 
.615 



a well-fitted model, we can proceed to 1 inspect each parameter estimate. 

The loadings of observed.data on corresponding traits are significant, 

> 

except for the Support suBscales. The.psi coefficient? which ane less- 
than 1 and comparatively low, with the respect to the size of their . 
standard errors, imply that the-tfaits of Coherence* Support, and Mechan- 
ics, are distinctive from each o.ther. The high loadings of observed data 
on the two method factors, discourse mode, indicate that the influence 
of discourse mode is quite large on observed data. «This result further < 
supports previous analyses suggesting that the discourse mode required by 
an essay task might dominate a ^student's performance. 

To examine whether the discourse mode effect holds across response 
mode as well, a second set of analyses investigated the factor structure ^ 
underlying subskill (trait) and discourse mode in pooled essay and muUiple C 
choice test scores. Six scores were formed for these analyses, two each 
(expository and narrative) for the three traits common to essay and multiple 
choice test scores, Focus, Organization and Support. This analysis did not 
combine Focus and Organization into a Coherence factor since this procedure 
would have produced a two trait, two-method model not appropriate for the 
LISREL calculations. The correlation matrix for these variables appears 
in Table 11. 

Insert Table 11 here " * 

Figure 3 displays the path diagram for Model III and Table 12 presents 

y 

the LISREt estimates of the free and constrained parameters. 



98 



Insert Table 12 here * • 

The non-significant probability of .6587 yielded by chi-square. of 1.6030 
with 3 df., suggests that data pooled across essay 'land multiple choice 
scores are adequately explained by this model (Figure 3). In other 
words, the-variances of each of the six observed scores. in this model are 
accounted for by two sources of influence, discourse mode and CSE subscale 
trait. 0 High loadings of the pooled scores on their corresponding traits 

(Focus, Organization, or Support) reveal that each score is well defined 

— * « , 

by its underlying trait/ Loading? on two discourse modes (narrative or 
expository) are low to moderate except for ORN. Apparently,^ ORN was more 
affected by discourse mode. In comparison to discourse mode, CSE traits 

* 

have more -dominant effects on pooled scores. The relations among trait 
factors show that Focu§, Organization and Support are moderately to highly 
correlated, (.697 to .831), supporting the hypothesis that the traits 
are distinctive, but not independent from one another. The traits Focus 
and Organization are again highly correlated (.831). In comparison to 

4 

the psi matrix far Model I, the addition of the multiple choice scores 
seems to increase the interrelationships among the subscales, suggest- 
ing that the multiple cho/ice data blurs the~distinctivenes§ of trait 
information yielded 'only by essay ratings. We examined this possibility 
directly through analyses of response mode effects. 

* 



Table 12 v ' f , t ■ 

LISREL Estimates for Discourse Mode and Refine Trait 
Effects Pooling Essay and Multiple Chojtae Scores 







Model 

/ 


III 


» 






. FO 


OR . 


su 


' N 


E 


FON 


.. .811 
(.074) 


0 

*> 


0 


.233 
(.381) 


0 


FOE 


,811 
(.074) 


o - 


' b . 


0 


.348 
(.135) 


ORN 


0/' 


.764 
• (.076) . 


0 


.660 
(.980) 


.0 


ORE 




.764 
(.076) . K 


0 


"o. 


.300* 
(.128) 

t 


5UN 


A 

0 


0 \ 


.786 , 
(.075) 


.080 
(1.75) 


.0 


» * 
SUE 


• 0 


. 0 


.786 
(.075) 




.460 
(.156) 








* 








FO 


OR 


SU 


N 


E 


FO ' • 

♦ 


1.0 


V 


i 


i 

* 


f 


OR 


,831 
(.062 


1.0 








sic 


.780 
' (.670) 


,697 * 
(.086) 


1.0 


t 






vx> 




i * 


*r * 



X3 = 1.6030 
P = 0.6587 



11 * 1 1 * 9 

The second measurement issue addressed by the study was whether tasks, 
^requiring different* response modes (direct, production modes: essays, 
paragraphs; indirect, selection modes: multiple "choice) provide thfc'same 
type and quality of data about student writing abilities. Analyses first 
examined correlations b.etween scores students received on their essays, 
paragraphs an'd multiple choicte^^estions when the scores were from mea- 
sures crtl in the same discourse mode and when ttlie scores were .on- tests 
in different discourse m&ies. Table 13 presents these cpmparisions/ 

• . j : 

Insert Table 13 here. • 1 



In* general the relationship between scores a student received on the - 
two direct-measures (essays and paragraphs) are stronger than the relation- 
ship between his/lier scores on the- direct and indirect measures. Further- 
more,* the substantially lower, correlations between rfesponse mode scores 
when the tasks also, differ jn discburse .mode further corroborates the 
discourse mode hypotheses. An'MTMM analyses was performed next to search 
more intensively for the,, factor structure underlying response mode effects. 
The method variables for response, mode were constructed according td pro- 
cedures analogous to 'thosre* used to construct the discourse mode variables, 
Each of the,five subsca-le scores provided by the essay and paragraph 
ratings was averaged over raters, then standardized within topic and dis- 
course mode. Standardizations were accompli shed. by removing possible 
interactions between response mode on one. hand and genre and/or topic on 



102 



Table 13 

Correlations Between Test Scbres from Different Resp.onse Modes \ 



All Tests Were in the Same Discourse Mode 
EE *' EP ^ EMc 



/ 



PMc 



GI 


.581 (78) . . 


.147 (153) 


x • 


'I 


fo 

OR . 


.559 (7& 
■ .473 (78) 


.275 (153) « 
.208 (153) • 


'.314 t316) 
'.15? (316) 


.371 (192) 
.300 (192) 


su. 


. .286 (78) 


.353 (153) ^ 


.136 (316) 


.323 (192) 


ME • 


.629 (78) . 


.582 (153) 




X . 



Total 



.624 (78) • 



.428' (153) 



' .326 (316) 



.471 (192) 



0 



• 


* 

Tests Were in Different Response Modes . „ f - 






EE 


EP 


w 

EMc • 


PMc 


GI 


.126 (94) 


.415 (90) 


X 


X . 


FO 


• 391 .(94), 


. .205 (90) 


.211 (316) 


.288 (192) 


OR 


.008 (94) 


.301 (90) 


.180 (316) 


.233 (192) 


*5U 
ME 


.229 (94) 
.583 (94) 


.345 (90) 
.632 (90) 


.180 (316) 

• '■ I 
X 


^203 (192) , 
X 


Total 


.355 (94) 


,484 (90) 


.325 .(316) 


.376 (192) v . 



GI - General Impression 

FO - Focus 

OR - Organization 

SU - Support 

ME - Mechanics 



EE - Essay: Essay * 

EP - Essay: Paragraph 

EMc- Essay: Multiple Choice ..' 

PMc- Paragraph: Multiple Choice 



103 



the other. Complett data were available for 148 of fhe* students. 

Number correct scores were formed within genre for each of the three 
multiple choice subscales, standardized within/tfenre, summed across genre 
and then restandardized to produce scores scaled in a manger comparable 
to "those derived from the writing 'samples. No measures of General Impres 
sioji and Mechanics were included in the multiple choice* task. 

In sum, 18 spores were constructed for analysis, three" measures each 
of General Impression and Mechanics (two essay and one paragraph), and 

4 

four each of Focus, Organization and Support (two essay, one paragraph 
and one. multiple choice). Tfie variables and the sbbreviatfens used to 
refer, to them in the response mode analyses appear in Table 14. 



Insert Table 14 here 



he relationship^ 



The transformed scores permitted analysis first of the 
among the response mode variables after removing genre effects. Table 15 
presents the correlations. matrix for the three response modes and five 
subscales. 

i 

Insert Table 15 here • 

-> r mm : . ," 

The M T MM analyses began by considering the data, for the "Essay 1" 
and "Essay 2" methods only, examining the eight scores defined foe these 
two conditions, two measures of Focus' (fep feg),*. Organization (oe^ o^),- 
Support (sep seg), and^Mechanics (me^, me^). While the models consider 
Coherence a? .the combination pf F^ucs and Organization, their separate 
observed scores are entered "into the analysis. The model specified for 



Table 14 



, Description of Variables in LISREl^ 
Analyses of Response Mode Effects 



WRITING .VARIABLES 


"Essay 1" 


"Essay 2" 


Paragraph 


Multiple Choice 


General. Impression 


\ i 


gie 2 


gip 




.Focus 

* 


fe l 




\ fp 


fmc 


Organization 


de. 


oe 2 


op 


omc 


Support 


se i 


se 2 


$P. 


smc 


Mechanics 

4 


me. • . 




mp 





105 



GENRE. STUDY 



GIEj 
61 E, 



•1 
FOE 2 

FOP 

FOMC 

ORE^ 

0RE 2 

ORP 

QRMC 

SUE 1 

SUE 2 

SUP 

SUMC 

MEE 



MEE, 
MEP 



1 



Table 15 

Correlation Matrix of Scores Containing Response Mode and Subscale Trait Effects. 



GIE 



1 



GIE„ GIP FOEj FOE 0 FOP FOMC ORE, ORE, ORP ' ORMC SUE, SUE 



1 



-1 



1.000 
0.282 
0.333 
0.407 
0.215 
0.132 
.0.214 
0.816 
0.281 

i 

0.235 
0.131 
0.670 
0.284 
0-.268 



i;boo 

0.-232 

o*.3ia 
o.c n 

O.L*) 
'0.325 



1*000 

0.23G 1.000 
0.271 0:424 
0,511 0.231 
01299 0.256 
0.197. 0.224 0.441 
0.775- .0.250 o'.345 
0.209 .0.810 0.195 
0.285 . 0.348 0.229 
0.274 0.253 0.434 
0.525 
07360 



1.000 ' - ' 

.0:27.0 -1.000- ' 

0.379 0.362 1.000 

0.211 0.057 0.226* 

0.608 0.192 0.353 

'0.279 0.559 0.286 

0.360 0.304 0.474 

0.247 0.252 0.149 

0.179 0.233 '0.475 0.186 0.215 

0.620 0.324 0,375 0,520.0.344 



1.000 
0.252 
0,176 
0.142' 
0.566 
0.246 
0.269 



1.000' 

0.228 

0.307 

0.23D 

0:'550' 

(f:322 



1.000 ' 
0.339 '1.. QOO 
0.223*0.181 
0.170 0.256 
0.579 0.328 



1.000 
0.314 
6.353 



0.314 0.165 Q.302'0.239 0.264 0.259 0.4JL9 0.312 0.253 ^0.291 0,437 0.197 

0.492 0.286 01305 0.423 0.234 0.305 0.222 0.436 0.264 0.232 0.285 0.396 

0.357 0^452" 0.392 0.365 0.-512 0.353 0.37 3 0.27 6 6.4 1 9 . 0.297 0.344 0.300 

0.369 0.298 0,478 0.345 0.352 0.433 0.311 0.311 0.399 0.429 0.352 . 0.314 



1.000 
0.294 
0.172 
0.266 
0.361 
p. 272 



SUP 



SUMC 
- — f 



MEE, MEE„. MEP. 



1.000 

0.245 1.000 

0.-297 0.3P 

0.428 0.354 

0.330 0.360 



i:ooo 

0.589 
0.603 



l.OQO, 
0.555 1.000 



106 



107, 



CO 



these variables includes the three relatively distinct tra\t/subscale 

content factors emerging from the discourse mode analyses and two "method". 

factors, one for each essay, Figure 4"di splays a path diagram for ModeT I, 

and Table 16 presents the LISREL estimates of the free and constrained 

parameters for the model . 

< — — — — — — — — — — — — — — 

Insert Figure 4 and Table 16 here 



Similar to the model specified for the discourse mode MTMM analysis, 
the Fjgure shows Model I- allowing the trait or subscale factors to be 

freely intercorrelated, while the method factors are specified to 6e un- 

* 

correlated with each other and with the subscale factors. The restric- 
tion on the method ^factor correlations corresponds to the hypothesis that 
they act as independent additive components in the explanation of the ob- 
served scores. In addition, we constrained the factor loadings for each 
pair of subscale measures on their corresponding trait factors to equal 
one another. These constraints are equivalent to a test of the hypothesis 
that subscale, scores from different essays will exhibit the same degree 
of relationship to the trait factor they measure, j 

The moijel as a whole cannot be rejected; the chi square goodness of 
fit test yields a probability of .202 (ns), suggesting ,that the model pro- 
vides an adequate account for the observed correlations among essay vari- 
ables.. Loadings of the essay variables on their respective trait factors 
are all substantial and highly signifffcafftl ranging from a low of .472 for Or- 
ganization to a high of .768 for Mechanics. The Organization subscale 



10 J 



Table 16 

LISREL Estimates for Response Mode and Trait Effects 

Model I 



24b 



LAMBDA ; C • 


S 


M 


E l 


e 2 ; 


Fe, .^633 
• 1 (.069) 

Fe 2 .633 
(.069) 


0 
0 


0 


.255 
(.100) 

J 
r 

0 


0 

t 

.485 
(.095) 


Oe. .472 
1 (.075) 


0 


0 


.671 
(.130) 


o • 

• 


Oe, . .472 
Z ' (.075) 


0 


0 


o ; 

c 


< .639 . 
(.105) 


Sej • 0 


.535 
1.0/7 ) 


o - 


•534 
1.118} 


0 




(.077) 


A 

0 


0 


• 507 
(.104) 


Me i \ °. " 


0 


. .768 
(.062) 


.289 
(.092) 


0 * 


Me, ' 0 
2 . 


0 • 


.768 
(.062) 


0 


.254 
(.089) 






v~ 






<PSf c 


S 


M 






t 

C 1 










S .799 
(.102) 


1 




* * 




M .706 

(.083) . 


.626 
(.115) 


1 






X 2 withdf. 13 


= 16.9398 








p = .2021 











loadings on the method factors are relatively high and the loadings of 
the Support '.subscale are moderate. Both Focus and Mechanics Idadings 

* * * 

are low. Turning to the psi matrix, we see that the estimates of the 
relations among the trait factors are moderate (below .80), ranging from 
'a low of .626 for the* coirelation between Mechanics and Support "to a 
high of .799 between Coherence and Support. The Mechanics factors, 
appears to be the^most independent in the set. . 

. Model II adds the data from the phragrajfo task and expands to 12 
the number of variables included in the analysis. Four subscales forming 
the three trait factors specified in Model I each have, under Model II, * 
an additional subscale score loading (no constraints are placed on these 
loadings); and a new, sepond method factor appears, Paragraph, to account 
for covariation specific to this mode of responding. Figure 5 displays 
the path diagram for this model and Table 17 presents -the- results of the 
LISREL estimation of the parameters of Model II. . r 

« 

" Insert Figure 5 here 

Insert Table 17 here 
. ^ 

Model M provides an adequate overall fit to the observed intercor- 

relations among the essay and paragraph variables (chi squ&re with 43 df 

■» 51.533, .175). This result provisionally supports the hypothesis 

that the scores generated by application of the rating system to paragraph 

length writing samples can be interpreted as measuring the same underlying 

* Hi 



\ 



/ 




Figure 5 

» 

Response Mode 
Path Diagram for Model II 








Fe 2 




Fp 




Oe,* 




0e 2 




fp 




Se, 




Se 2 




Sp 




112 





. J Table 17 

LISREL Estimates for Response Mode and Trait Effects 

Model II 



LAMBDA C S „ M • Ej * E 2 P 

c 

Fe, .581 0 0 . 282 0 0 

1 (.068) - (,096) 

Fe 9 ^ .581 0 0 " * 0 ^ .495 0 , 

V c (.068) * .(.094) ; \ 

Fp .485 0 / X 0. 0 ' 0 .473o 

(.094) h • (.100JL 

Oej .468 0 0 . 683 " * 0 0 



(•§69) 



(.126) 



0e 9 .468 0, 0 0 .650 0 ' 

, . (-069) ; • _ (.104) . 

Op - .417 0 • , 0 Q. 0 * * .780 

(.095) (;114) 

Se, 0 .522 0 .502 .0 0 

. 4 (.070) • • f J.111) ^ .. , 

Se 9 - 0 -.522 0 0 .461 0 

- . (.070) • (.099) 

Sp ' - 0 .639. 0 0 0 , .421 

* (.095) (.0$5) 

Me, 0 0 .773 .250 0 0 

1 (.062)^ (.081) 

Me 9 0 0 .773 . 0 .192 0 

(.062) (.080) / 

Mp 0 0 .728 0 * 0 .207 

(.078) (.078) 



\ 



^PSI 


C 


S 


M 


c 


1 






D 


.937 
(.061) 


1" " 




M 


} .809 
J (.066) 


.684 
(.082) 


1 


X 2 


with df of 43 


= 51.5331 




P 


* .175 







content as the scores derived from full length essays. Inspection of the 
lambda matrix shows that the loadings for paragraph subscale scorbs on their 
associated strait factors are of substantial magnitude in each "case, and 
that the Toadings on the paragraph factor follow the same general pattern 
as for the two essay method factors. With one exception, the paragraph 
variables appear to. relate to trait factors less strongly .than do the essay 
scores. The one exception is an interesting one: "sp" provides a clearer 
, definition of the Support factort than either of the support measures derived 
from essays. ' This v^ould seem to suggest that the rater's task of judging 
the use of support is carried out more distinctly'in the context of single 
paragraphs than it ii in longer writing samples. A test of this hypothesis, 
however, would require multiple paragraph measures of the "sp" variable. 
^ *' As in Model I, the trait interconnections in the Model II Psi matrix- 
are moderate to high, indicating considerable interdependence among the 
subscales. Ag^in, Mechanics exhibits lower levels of relationship to the 
other subscales. 

Comparison of Models I and II reveals two main differences. First, 

there is some instability in the size of the essay variables 1 loadings on 

the associated trait -factors as we move from the first to the. second model. 

This leads to the interpretation that the factors composed of both essay 

and paragraph variables do not measure precisely the same content as factors 

composed of essay variables only. • (In order to retain the same factor traits 

across. both models, we can constrain the lambda coefficients and psi coeffi- 

cients in Model I to be identical to corresponding, coefficients in Model II 4 

v 

This procedure is applied in latermodels, ) Second, estimates of the trait 



ERIC 



114 



27 



intercorrelations in Model II are greater than their counterparts in Model I. 
^ Thus* although. the inclusion of paragraph scores may have broadened the 
- content of the trait factors, it seems also to have diminished their dis- 

* 

tinctiveness. v 

The third MTMM analysis builds on the previous two by adding the threp 
scores derived from the multiple choice items administered to the students 
\in the study. Recall that only items analogous to the Focus, Organization 

•and Support subscale£ were included in the multiple choice test. Model III 

) . 

differs from Model. II, then, by the specification of trait loadings for 
these three subscales, and the addition of a multiple choice method factor. 
Figure 6 displays the path diagram for Model 1 1 a/id Tablfe 18, the LISREL 
estimates of the model parameters ♦ 



Insert Figure 6 and Table 18- here 



As in the first two analyses, Model III provides a reasonably good fit 
to the data (chi square with 76 df .=.84.962, .226), implying that the 

same 3-trait structure i$ not violated by the inclusion of the multiple 

• &t. * ■ ' '>* A 

• choice scores. By and large, however, the Model III estimates of the essay 

* •%* ■« *- 

trait factdnjoadings have dropped in value io comparison with the corre- 
sponding estimates from Models I and II. Also, the trait factor intercor- 
relations in Model III haVe increased for Coherenqe and Mechanics, indi- 
cating that the subscale content factors has drifted closer* together as 
» 

a result of adding the multiple choice variables. Thus, while the multiple 



choice scores apparently share some content wit^ the constructed response 



variables to which they are purportedly analogous, they also seem to possess 
a higher degree of "latent col linearity" {Yates, 1979) 4 in the trait factor 



, Tat'e 18 .. 

LISREL" Estimates; for Response Mode and Trait Effects 



Model III 



. J 

1 


4 

c \ 


* — 


M 


E l 




• P 

* 


Me- 




.585 
(.065) 


0 


0 . 


.323 
(.092) • 


0 


0 


0 i 




. .585 
(.065) 

.514 
(.089) 


. - .Q 

. .0 \ 


•0 
0 


0 , 
0 


.472: 

(.C90) 

• 0 


0 

.444 
(.100) 


0 
0 


Fmc 


.546 
(.088) 


0 


0 * 


0' 


0 


0 


. .399 
(.125) 




.483 
(.066) 


0 


0 


.672 

(.Hi) : 


0 


0 


0 

1 


0e 2 


.483 
(.066) 


0 


0 


0 


.619 
(.100) 


• 0 


0. 


Op 


. .470 * 
(.090) 


0 


0 


0 


«? .0 


.737 
(.118) 


0 ■ 


•One 


.528 
(.089) 


p ' 


■ 0 


0 


0 


0 


.462 
(.136) 


Se l' 


0 


< ',469 
. (,067) 


0 


.546 
(.103) 


0 


0 


0 


Se 2 1 


' 0 


.469 
(.067) 


0 


" 0 


.480 
(.098) 


* 

0 


0 


sfr • 


0 


.611 
(.089) 


0 


. 0 


0 


.398 
(.094) 


b 


Smc 




.483 
(.091) 


. 0 


0 


0 


0 


Kits) 


Me x . 
Me 2 


0 


0 

. 0 


.772 
- (.061) 

.77^ 
. (.061) r 


. .277 
(.078) 

* 0 


0 

.183 
(.078) 


0 
0 


0 
0 


Mp 


0 


o. 


.747 
(.077) 


0 


0 


.182 
(.078) 


..,,0 . 



C S M 

C I- * c 

S .978 . 1 

(.049) 

M .786 .786 1 
(.059) t073> 

~ X 2 with 76 df » 84.9619 
P = 0.2255 

. 113 



space. Whether this situation arises because the multiple choice variables 
are related to writing ability *in some non-specific fashion* or because 
all 'of the variables, but especially +he multiple choice scores^, share a 
common dependence on general verbal ability, can not be disentangled with- 
out additional analyses, including tjssts marking general ability factors. 
In any event, it is reasonable to interpret the^ increased interdependence 
among trait factors *as anindication that the multiple choice scores possess 
generally lower validity as indices of distinctive components of writing 
ability than do measures based on actual writing, samples. 

Model IV examines the relationship of the paragraph and multiple choice 
variables to the set of-trait factors defined solely on the basis of the 
essay variables. Generally speaking, both the contribution of the essay \ 
scores to the definition of the'suBscale content factors and t\ie degree of 
independence of these factors from one another were reduced as data from 
the alternative response modes .were added to'the analysis. Model IV takes 
the trait faotor structure that obtained when only e'ssay scores were in- 
eluded in the analysis (i.e., Model I) as a criterion definition of the. 
content underlying the subscales, treating the, Model I trai£ fatetors as 
"unmeasured" criterion variables against which to compare the scores from 
the oth^r two resporis'e'modes. This can be accomplished in LISREL by modify- 
ing the specification, for Model II in two ways. First, instead of esti- 
mating trait loadings for essay variables, new specifications fix their 
values to equal those estimated in Model I: Setond, we place a similar 
constraint on the trait factor intercorrelations in.-Psi, by fixing their 
values at those obtained in the essay-only solution for Model I- These 

•v ' ' ■■ ■ 



twosets of restrictions will ensure that the trait factors found in Model I 

s . 

will be reproduced exactly in Model IV , and^the standing of the paragraph 
land multiple choice variables /can be evaluated vis-a-vis the essay criterion 
/trait structure. The LISREL estimates of the free parameters in Model IV 
are contained. in Table 19. 

i ; Insert Table }'9 here^ 

The only parameter estimates^ of direct interest in Table 19 are the 

trait factor loadings for the paragraph and multiple choice variables. 

The data^ indicate near uniform reduction in. their magnitude, in comparison 

to the estimates obtained from Modyi III. This shift does not reduce the- 

overall model fit (ch'i square with 83 df = 95.547, £= .164). In all but 

two instances, paragraph and multiple choice trait factor loadings are lower 

thanthe ctirresponsing loadings for the essay variables. Both exceptions 

ai^ recurrences of the findings from Model II and Model III that the mea- 

sure of Support derived from a paragraph length writing sample outperforms 

thp Support measures based on full lei^th essays and the Organization score 

in multiple choice is slightly more distinct than* the Organization measures 

on essays respectively. Support as measured by multiple choice items 

seCTS to reflect relatively little of what is measured in actual writing 

samples; wh i 1 a multiple choice measures of Focus and Organization seem 

to, convey a roughly comparable amount o£ information about subscale content 
i 

to that contained in a single paragraph. 



» * 



. Tal?le 19 

» 

LISREL Estimates for Response Mode and\ Trait Effects 









ModeV 


IV ; 






• 




c 


.S ' 


M 


E l 


h 


P 


Mc 




.633 


OL 


0 


.333 
(.092) 


o" 


0 . 


• 0 


Fe 2 


s .633 

t 


0 


0 


0 

< 


.436 
, (.088) 


. 0 


0 




.463 ' 
(.088)- 


0 


0 


0 


0 


.457 
(.095) 


0 


Fmc 


V ..526 
,\ (.087) 


%°. 


o 

1 


0 


* 0 


0 


• .407 . 
(.115) 


Oe, 


.472 


r 0 


f 

0 


.676 . 
(••109) 


0 


0 


0 

1 

o- V 

a 


0e 2 


..472 * 


Q. 


0 


,0 


.606 
(.100) ' 


0'- 


Op 


•.408 
^ (.088) 

M.088) 


0 


0 


0 


0 


.767 
(.113) 


0 


Omc 


0 


0 


0 


•J . . 

P \ 


0 


* .452 
(.123) 


Se, 


0. 

y 


• .535 


0 


.538 
(.099) 


• 0 




0' . 


Se 2 


0 


.535 


0 


0 


.496 
(.100)* 


. 0 


0 


Sp 


0 


.604 
(.089) 


0 


. 0 


0 


.414 
(.089) 


0 

• 


Smc 


o 


.410 
(.094) 


0 




0 


)0 


.475 
(.131) 


Mej 


0 


0 


.768 


.290 
(.078) 


0 


0 


0 


Me 2 


o , . 


0 


.768 


0 


.1/8 
(.079) 


t 

0 


• 0 


Mp 


0 


0 


.719 
(.072) 


0 


0 


.197 
(.075) 


0 


x 2 


with 83 df « 


95.5468 






% j 







p * 0.1636 



121 



Summary and Conclusions * 

The purpose of the study was to examine the comparability' of* writing 
competency profiles derived from test tasks differing in' discourse and 
response mode. Theory and research in the fields of learning, instruction 
.and rhetoric have fueled contentions that the knowledge structures .and 
processing strategies activated by different writing aims- and modes of re- 
sponding are quite distinct. We were attempting to demonstrate the ro- 
Justness of these claims from a measurement perspective. 

In practice, many current writing assessment programs fail to consider 
the validity of test datavthat does not distinguish between the demands- 
of types .of writing tasks and between the requirements of production and 
selection. At heart, the issue is one of construct validity, do these 

alternative task and processing variables measure the same thing? Our 

' • i — . 
results indicate that the answer is "no." 

In this study the results of correlational, parametric and multi- 
trait multimethod analyses indicate that levels of performance vary on 
tasks presenting different writing purposes. Theso data cast doubt on 
the assumption that n a gpod writer is a good writer" regardless of the 
assignment. The implication is that, writing for different aims draws 
different skill constructs which must therefore be measured separately 
to avoid erroneous, invalid interpretations of performance.. The find- 
ings suggest that generalizations about student writing competence must 
reference t)ie particular discourse domain rather than the general domain 
of writing. 

The study also investigated the distinctiveness of information 
about writing competence provided by direct and indirect measure- 



122 \ % 



meat techniques.. Again the issue is one of validity, do both response 
modes measure* .the same skill construct? Again the answer is "no." When* 

essay performance is set as the criterion, multiple choice performance 

< 

seems to be a*" poof proxy. Tests of response mode effects within an MTMM 
framework suggest that scores, on the General Impression and Organization 
stibscales contain large method components when measures are taken from con- 
structed responses. A plausible explanation for this finding is that the 
method factor loadings for these variables are inflated by within-occasi'on 
(e.g., a given essay) residual linkages between GI and 0 brought about by 
raters \ tendency to depend more on Organization than on other specific 
features in formulating their General Impression rating. The remaining 
three subScales all were found to .contain proportionately larger amounts 
of content-related variance than method related variance, with Mechanics 
appearing to be the purest of the three. The patterning of method variance 
saturation in the five subscales was the same for the three writing sample/s 
available for each subject. 

An interesting picture of the effects of the varying response mode 

( 

emerged from the analyses. While models canjae fitted to the data from 
all three response modes that confirm the subscales 1 content, the de- 
gree of independence of the resulting subscale factors appears to be 
affectedly which response modes are included in the analysis. The most 
differentiated subscale factor structure is obtained by including only- 
essay variables in the analysis; intercfependence among the subscale factors 
increases with the addition of both paragraph and multiple choice measures. 
Thus, < the effect of shortening the assessment task for the examinee through 



123 



32 



' f 

examination of just paragraphs or of changing the form of the response 
(multiple choice tasks) does not simply increase the measurement error. 
The pavings in. testing time are obtained also at the c6st of clarity and 
distinctiveness in the information about each of the subscales. When 
the subscale content factors are located in the variable space so a$ to 
maximize their relationship to scores derived from the essay response 
mode, all other subscale-response mode combinations, except one. provide 
weaker substantive information. The one e, ;eption is the measure of 
Support based on paragraph-length writing samples which seems to be su- 
perior to the corresponding essay variables in itsi^bility to capture- 
subscale content. It may be that the use of support is less equivocally 
evaluated in the context of a single paragraph than in an essay contain- 
ing multiple paragraphs, each of which may suggest a different view of the 
examinee's ability to provide supporting detail. < 

The MTMM analyses also provided information about the validity of the 
rating scales. The MTMM analyses suggested, first, that repeated applica- 
tions of the CSE Scoring rubric, to writing samples in fact produce measures 
that tap the same underlying content. Thus, given ^multiple measures of 
each subscale, it is possible to fit a factor analysis model that confirms 
their hypothesized content. Second, it was found that factors reflecting 
the content of all five subscales are strongly Inter correlated, and this in- 
terdependence appears to be present no matter what response mode subjects 
are assessed in. When the global judgment for General Impression was re- 
moved*fnd Focus and Organization combined into a Coherence subscale, scale 
intercorrelations became more moderate and distinct. Since techniques 



124 



for producing writing that is coherent, supported and mechanically correct 
are often taught separately, further examination of the value of rating 
writing according to sep" *te component features should consider both their 

r 

diagnostic utility and component distinctiveness. 

Finally, the study exemplified the contribution MTMM analyses can 
make to validity studies* The technique may provide more sensitive, pre- 
cise statistical indices of hypothesized competencies underlying test, 
performance. 

£ 

In summary, the study highlights the importance of precision in de- 
signing, analyzing and reporting writing assessment data. It may be that 
the techniques developed for specifying domain-referenced skill boundaries 
can provide a reasonable framework for focusing attention, discussion, 
assessment and instruction on clearly bounded classes of writing perfor- 
mance. 



4 



125. 



References 



Anderson, R. C. The notion of schemata and the educational enterprise. 
In Anderson, R. C, Spiro, R. J., & Montague, W. E. (Eds.), Schoplini 
and the acquisition of knowledge . Hillsdale, NJ: Lawrence Earfbaum 
Associates, 1977. ~ 



BraddoclT, R., Lloyd-Jones, R., & Schoer, L. Research in written compo- % 
sition . Champaign, IL: National Council' of Teachers of English, 1963. 

Breland, H. M. A study of college English. placement and the test of 
standard written English* Princeton, NJ: Educational Testing Service, 
January, 1977. 

Breland, H. M., Conlan, G. C,, & Ragosa, D. A preliminary study of the 
Test of Standard Written English . Princeton, NJ: Educational Testing 
Service, 1976. 

Breland, H. M,, & Gaynor, J. L. A comparison of direct and indirect as- 
sessments of writing skill. Journal of Educational Measurement , 1979, 
16, 119-128. -v 

Campbell, D. T., & Fiske, D. W. Convergent and discriminant validity by 
* the multitrait-multimethod matrix. Psychological Bulletin , 1979, 56, 



Coffman, W. E. Essay exams. In R. L. Thorndike (Ed.), Educational mea- 
surement {2nd ed.). Washington, D. C: American Council on Education, 



Cooper, C. R. Current studies of writing achievement and writing competence 
Paper presented at the annual meeting of the American Educational Researcl 
Association, San Francisco, 1979. 

Cooper, C, & Odell, L. (Eds.) Evaluating writing: Describing) measuring , 
judging . Buffalo, NY: State University of New York at Buffalo, 1977. 

Crowhurst, M. Syntactic complexity in narration and argument at three 
grade levels. Canadian Journal of Education , 1980. 

Diederich, P. B. Measuring growth in English . Champaign, IL: National 
Council of Teachers of English, 1974. 



Bourne, L. J. Human conceptual behavior . Boston; Allyn and Bacon, 1966. 



81-105. 




126 



French, J. W. Schools of thought in judging excellence of English themes. 
1961 Proceedings of" Invitational Conference on Testing Problems,^ Edu- 
cational Testing Service, Princeton, NJ, 1962. 

Godshalk, F. I.i Swirreford, F*f, & Coffman.W. E. The measurement of 
writing ability .. New York": College Entrance Exami nation Board, 1966. 

Graesser, A-! C, Hauft-Smith,,K. , Cohen, A. ft,, & Pyles, D. Familiarity 
and test genre on retention of prose. Journal of Experimental Educa- 
tion , 1979, 48, 281-290. 

■ • 

Hogan, T. P., & Mishler, C. Relationships between essay tests and objective 
tests of language skills for elementary school students. Journal of 
Educational Measurement , 1980, 17, 219-227. 

Joreskog, K. G. Analyzing psychological data by structural analysis of 
covariance matrices. .In R. C. Atkinson, Q. H. Krantz, & P. D. Suppes 

f(Eds>), Contemporary Development in Mathematical Psychology , Vol. II. 
San Francisco: W. H. Freeman & Co., 1974,. 1-56. 

Kinneavy, J. L. A theoryjof discourse . The Aims of Discourse. Englewood 
Cliffs, NJ:' Prentice Hall, Inc., 1971. . 

Meyer, B. F. The organization of prose and its effects on memory. North 
Holland Studies in Theoretical Poetics (Vol. I), Amsterdam: North 
Holland Publishing Company, 1975. 

Minsky, M. A framework for representing knowledge. In P. HrMinston (Ed.), 
The psycho! ogyyof computer vision . New York: McGraw-Hill , 1975. 

Perron, J. D. Written syntactic complexity in modes pf discourse. Paper 
presented at the annual' meeting of the American Educational Research 
Association, New York, April 1977. 

Pitts, M. The relationship of classroom instructional characteristics 
and writing in the descriptive/ narrative mode. Report to\the National 
Institute of Education, November, 1978. (Grant No. OB-NIE-G-78-0213 
to the UCLA- Center for the Study of Evaluation.) 

Praeter, D., & Padia, W. s Effects of modes of discourse in writing per- 
formance in grades four and six. Paper presented at the annual meeting 
"of the American Educational Research Associatibn, Boston, 1980. 

Quellmalz, E. Interim Report. Defining writing- domains: Effects of dis- * 
course and response mode* Center for the 'Study of Evaluation, 1979. 

,San Jose, C. P. M. Grammatical structures in four modes of writing at 
fourth grade level. Unpublished doctoral dissertation, Syracuse 
University, 1972. Dissertation Abstracts International , 1973, 33, 5411-A. 



127 



Skinner* B. F. Teaching machines and programmed learning - New. York: x 
Appleton Century Crofts, 1963. 1 - «. 

Smith, L S. 'Investigation of writing assessment strategies. Report to h 
the National Institute of Education; November, 1978. (Grant No. OB-NIE- . 
G-78-0213 to the^UCLA Center for the Study of Evaluation.) 

Traub, R. E., & Fisher, C. W. On the equivalence of constructed-response 
and multiple-choice 'tests. Applied Psychological Measurement , 1977, 1,(3), 
. 355-369. 

# 

Veal, L.R., & Tillman, M. Mode of discourse variation in the evaluation 
of children's writing. Research in the teaching of English , 1-971, j>, 
37-45. 

Werts, C. E., Joreskog, K. G., & Linn, R. L. A multitrait-multimethod 
model for studying growth. Educational and Psychological Measurement , 
1972, 32, 655-678. 

Winters, L. The effects of differing response criteria on the assessment 
of writing competence* Report to the National Institute of Edrcation, 
November, 1978- (Grant No. 0B-NIE-rG*78-0213 to the UCLA Center for 
the Study of Evaluation.) 

Yates, A. Distinguishing trait from method variance in fitting the invar- 4 
iant common factor mode?} to observed intercorrelations among, personality 
rating scales; Paper presented at the annual meeting of the Society 
of Multivariate Experimental psychology, Los Angeles, CA, 1979. 



128 



PROBLEMS IN.STABLIZING THE JUDGMENT PROCESS 



Edys Quellmalz 



Center for the Study of Evaluation 
Graduate School of Education/ 
University of California, Los Angeles 



The project presented or reported herein was performed 
pursuant .to a grant from' the National Institute of Edu- 
cation, Department of Education. However, the opinions 
expressed' herein do not necessarily reflect the position 
or policy of the National Institute of Education, and 
no official endorsement by the National Institute of 
Education should be inferred. 



129 • 



Problems 1n Stabl1z1ng the Judgment Process 

Edys Quellmalz 

Center for the Study of Evaluation 
University of California, Los Angeles 

The Increasing demand for competency assessments of comply human 
performance Has led to renewed scrutiny of the conceptual arxk technical 
quality of prevailing testing practice. Particularly 1n the area of 
language production, I.e., writing, oral language and oral reading, re- 
searchers and practitioners assert that competency tests must provide 
tasks that match performance objectives and that activate cognitive 1 
processing strategies required by production rather than recognition 
tasks. The validity of Indirect (I.e.* multiple choice) measures Is no 
longer logically, psychologically or ecologically acceptable to the 
majority of professionals in writing Instruction and evaluation. Life 
1s not a multiple choice. Students' language production skills, 1n par- 
ticular, must be sufficiently proficient for students to function auton- 
omously 1n the real world. 

Although collecting samples of complex performance can presumably 
provide "direct," valid measures of content, the renowned unreliability, 
of judging constructed responses continues to plague assessment method- 
ology. Because direct performance samples are mediated by highly vari- 
able judgments ot^raters-who scor^ or characterize performance samples 
along some dimensions, a critical goal for performance judgment In gen- 



eral, and for writing judgment 1n particular, is to find ways to assure 
that judges apply scoring criteria accurately and fairly. As a part of 
a broader^program studying Issues in test design, we have investigated 
'dimensions of the test tasks, context and scoring that will reduce irrele- • 
vant variability in examinee and rater behavior. 

0 1 

This, paper analyzes a series of measurement problems that jeopardize** 

A* 

the validity of the judgment process and examines the effectiveness of 
methods currently enjoyed to address these problems. Reviews of pre- 
vailing rating practices, in conjunction with cumulative empirical evidence 
on factors influencing judgments in dpmaln-referenced assessment demon- 
strate that direct writing assessment faces a dual validity requirement. - 
Both the test task and the scoring procedure must meet separate conceptual 

i" 

and statistical validity standards. The paper elaborates the requirements f 9 
for accurate and fair writing competence assessment and Illustrates how 
state-of-the-art rating processes pose serious threats to the validity of 
the writing assessments. 

i 

Domain-Referenced Scoring* Requirements 

The avowed intent and structure of competency or domain-referenced 
tests require explicit, replicable scoring criteria and procedures; thus, 
the need for methods to stabilize rating criteria and readers 1 application 
of them is immediate and real. Soon the uniform application of performance 

criteria may become a legal requirement when decisions based on these tests 

r 

result in Hfe-altering consequences for students. Mandates proliferate 
at state and local levels for writing assessment at all levels of public 



school, fcnd large numbers of writing samples must be scored by great 

* ♦ 
numbers of raters. Many assessment programs are required to provide stu- 

dents repeated opportunities to pass comparable' forms of a test.. Also 

* 

built Into mar\y assessment programs Is a requirement to administer com- 

. r * 0 

0 * 

parable tests, at regular Intervals, at geographically separate sites. 

The purpose of thesg competency assessments *fs to monitor develop- 
ment of students' skills at points specified throughout their schooling, 
to detect skills for which they might need remedial assistance and to 

a 

document skill development. -A student who falls to demonstrate competency 
in writing, receives additional Instruction, and is then retested should 
be judged according t6 the same standards at each test administration. 
His or her score should not depend' on either the performance^ a new 
cohort of examinees' nor upon the idiosyncratic values of differently ori- 
ented sets of raters. 

0 

, Unfortunately, many writing assessment programs derive their guide- 
lines from norm-referenced test methodology. Ia practice* norm- referenced 
writing tests are scored by ranking" papers within the .limits of a partic- 
ular sample. Essays are usua'lly scored hollstically, on generally de- 
scribed criteria, and involve scoring procedures where raters rank essays 
by sorting them into piles anchored by the range of quality of that 
particular sample (Conlan, 1976). Thus a particular paper's rank and/or 
score could change from sample to sample, if the range of the quality of 
_the competition varied fronfone test group to the-next^ Such practices- 
result 1n a, "sliding scale" where the rated quality of a particular paper 

* 

changes according to the quality range of papers in the group. For example 

- 132 



a student might take a writing competency test in .t|e fall, when all stu- 
dents, low achievers to college preparatory students i participate. A 
student 1 shrank in this wide quality range is below mastery. In the spring* 
the student, along with the restricted range of students who failed the 
first administration, takes another writing-com'petency test and just passes 
Does s/he pass because Intervening writing Instruction has strengthened 
weak writing -skills, or because her or his rank 1s higher 1n the restricted 
range of poorer writers? Present holistic scoring procedures can not pro- 
vide an answer to this question- The holistic ^score provides no evidence 
of the developmental level of specific writing weakness that were low and 
may have improved. Despite the use of "anchor" papers during training to 
Illustrate what a "6" or "3" had been for other groups, the most. prevalent 

'i 

holistic scoring procedures still require raters to distribute papers ' 

> 

across the score range. , % 

A major measurement problem^onfronting many competency based writing 
assessments, then, is 1 the failure to deal with t5he need to assure compar- 
ability of scoring between test occasions as well as within a scoring 
session. Such comparability would require not. just statistical indices 
of rater agreement but comparisons ,bf mean scoVes, since ratings within 
a session might agree but differ t between sessions*. Adopting a norm-refer- 
enced method of criteria application, based on ranking within occasion 
• *j * 

imperils, if not precludes between-occasion uniformity of criteria ap- 
plication. Therefore two measurement l>rob.l ems inhere in judgment sta- 
bility, stability within a session and stability acrccss^ sessions. 

y 

133 



• -.t- ■ . . ■ . ' i 

t 

To document scale stability, an assessment would have to intersperse 
•"W anchor papers scored in previous assessments among papers rated within 

* an on-going rating session and report comparability of anchor paper scores 
across test occasions and rater groups. Such documentation of compara- 
: . abjlity is conspicuously absent in both research and practice. 

- Research on Rating Variability 

Evidence pointing to the. sources and manifestations of scale Instability 
can be found inthe rapidly accumulating body of research on issues of 

•rating variability. The instability of ratings has been a major, and gen- 

\ 

erdlly acknowledged, weakness Qf measures of writing skill (Coffman, 1971b). 
Braddock, LlWd-Jdnes and Schoer (1963) classified four sources of error: 

1) the writer, 2) the assignment, 3) the rater, 4) between raters. Although 

•a 

considerable Research within the framework of domain-referenced testing 
has examined dimensions of the test task that influence writer performance 
suclvas discourse aim and topic modality (Pitts, 1978; Spooner-Smith, 1978; 
Quellmalz, 1979; Praeter 4 Padia, 1980; Crowhurst, 1980), less attention 
has been given to the factors involved in rater behavior. 

In the broadest sense, Inter- and intra-rater variability are a matter 
of fluctuating standards of judgment. Research has amply demonstrated 
that anarchical scoring of essays, where raters apply their individual 
standards,, results in high disagreement among raters from different occu- 
pations (Diederich, French, & Carlton, 1961) and even among English pro- 

fessors"{Findlayson, 1951; McColly, 1970). Foilman and Anderson (1967) - * 

demonstrated that th£ more homogeneous the background of raters, the more 



134 



their scoring agreed. Long ago» Eels (1930) demonstrated the problem 

of intra-rater criteria bias when he found that the variability In essay 

scores assigned even by the same reader on different occasions approached 

the degree of variability 'of scores assigned by different readers. Recog- * 

nizing the magnitude of errqr occurring in unstructured scoring, researchers 

^attempted to devise various techniques /or controlling score variability, 

* * * - i 

Methods for Controlling Scoring Variability ' 

« The first and most critical step in stabilizing the bases of readers' 
judgments is to establish common, explicit scoring criteria. Criteria 
may either be specified deductively by 'invoking standards derived from 
the rhetorical tradition (e.g., Kinneavy,_1971) or. inductively by seek- 
ing commonality among readers* comments on papers (Diederlch, 1974; 
Freedman, 1978)." Systematic training on common scoring criteria has 
proved to reduce^some kinds of interrater variability effectively. 
(Stalnaker, 1934; Diederlch, 1974). As a result of these pioneering 
studies, st&ndard^toethodology now includes training, of raters on the use 
of rating scales until a high level of agreement among raters is achieved. 
In a recent study of the discriminative validity of alternative scoring 
rubrics, Winters (1978) suggested that high rater reliability coefficients 
in pilot or in final rating sessions might not necessarily signal standard, 
uniform interpretation of rating scales over rating occasions and across 
rater groups. During rater training she observed that less operational i zed 
$cale rubrics stimulated extensiv£ discussion and interpretation and sug- 
gested that different rater groups might achieve high reliability, but 



135 

♦ 



have interpreted vague criteria differently by devising different Specific 

\\ 

decision rules for the same ambiguous criteria* Thus, high reliability 
coefficients might be obtained, but at the cost of Accurate, repli cable „ 
scoring. As Winters implies, redefinition of criteria by the social \ratih^ 
group can have serious implications for the fairness of ratings acros 
rater groups/ 

• 

Rater Drift 

Even with training for rater consensus, when raters practice applying 

explicit criteria, rating- fluctuation may still occur. The deviation of \ 

' \ 
raters from previously-shared criteria Is termed "rater drift" and may be\ 

signaled by lowered inter-ra'ter reliabiliy and. differences between raters' 
criteria interpretation and expert-generated criterion-based ratings. 

Rater drift Is particularly a problem when there are large sets* of 
papers to be scored. Shifting criteria or drift may be caused by rater 
fatigue, or by jnore systematic influences, such as the quality rarjge of 
the sample of papers b'eing read or idiosyncratically va]ued<.cr1teria. 
In a description of the rater as a source of wror, Braddock'et al. 
(1963), discussed the need for controlling for rater fatigue. They cited 
fatigue as a cause "for raters to become severe or erratic in their eval- 
uation or to place more. weight on particularly noticeable essay elements 
such as mechanics. Godshalk, Swineford, and Coffman (1966) found signif- 
icant differences between papers scored hollstlcally early and later in a 
set of 646 papers. Coffman (1971b) warned that even when two sets of scores 



136 



• 8 



\ 



derive from changing combinations of raters, M there may still be differ- 

* 

ences in the means and standard deviations attributable to order effects 

~ that is,' the tendency of groups of raters to shift their standards as 

the reading proceeds" (p. 276). Coffman (1971a) also discussed raters* 

tendency to,regress to their own internalized set of standards and recom- 
v- 

mended practice on common criteria. 

Rater drift impairs the technical quality of rating results by reduc- 
ing inter- and intra-rater reliability, anymore importantly, compromises 
the validity of ratings. However, writing assessment programs do not seem 
to acknowledge rater drift as a validity problem, nor do they deal with 
rater drift directly. 

State-of-the-Art Procedures for Treating Scoring Variability 

Current rating procedures (Conlan, 1976; .Office of the Los Angeles 
Superintendent of Schools, 1977) generally follow methods recommended by 
Braddock et al . (1963), and Coffman (1971a) and have evolved a number 
of methods to deal with raters variability. Typically, raters begin by 
practicing applying a rubric to a sample set of papers. The nature and . 
relative specificity of scale criteria and scoring formats ( holistic vs. 
analytic ) vary, as* do the weights of component criteria. Before independent 
rating begins, trainers conduct a reliability check. Sometimes consensus 
is checked statistically; sometimes it is indicated by a show of hands. 

During independent ratings, methods for dealing withrater agreement 
tend to .take two tacks: correction and maintenance. Procedures which 
emphasize correction use post hoc methods to treat score discrepancies: 



ERJC ; ' 137 



Common options are: 1) having a third reader score any paper where the r 
* first readers disagree by more than one point; 2) using the sum of two 
ratings as a total score; 3) randomizing the order in which two raters 
score' an essay in order to distribute rater error, although often the 
randomization occurs in a single day. These post hoc correction proce- 
dures sidestep the validity problem of the changing criteria employed 
by the drifting rater, 

A second set of procedures for dealing with rating variability aims 
at maintenance of scoring accuracy. Periodic consensus checks on iden- 
tical papers are interspersed at varying interval s. Checks may be common 
to all raters, discussed in the group, discussed within rater pairs or 
discussed with a "master" rater. In the procedure, discrepancies are 
called to the raters attention and. their bases revised. These main- 
tenance procedures at least attempt to prevent, detect, and control scor- 
ing error by providing feedback to individual raters regarding the accuracy 
and consistency of- their scoring decision rules. 

Rating Variability in Competency Assessment Research 

In a series of studies examining dimensions important in the formu- 
lation of valid, instructional ly sensitive writing assessments, we 
documented the effects of several stringent procedures for attaining .and * 
maintaining rater congruence and fidelity to the rating scale. One com- 
ponent of the methodology was to develop analytic scoring rubrics referenced 
to basic structural features of a di course mode. Explicit criteria were 
designed to reference operational, instructionally manipulatable elements 



, . ; ■ 138 



of the paper. Raters practiced applying the scoring rubric in intensive 
training sessions and reliability checks using general izability statistics 
were calculated to assure inter-rater reliability. During final, indepen- 
dent ratings, common checks occurred at frequent intervals. Discrepancy 
resolution procedures were of several types, including group discussion 
or pair discussion. The research focus of these studies was. on variations 
of the tasks of writing rather than on variables influencing the rating 
process, yef^the accumulating data indicated that stabilizing the judg- 
ment process was a complex issue— one deserving direct experimental in- 
vestigation. ^ This conclusion derived primarily from three of our studies 
in which we observed rater drift surface as a problem, despite the differ- 
ent procedures used to prevent it. We also began to inspect indices of 
scale stability by looking at scores given by raters trained at different 
times to the^same set of papers. 

Rater Drift 

In our writing assessment research our initial scoring concerns were 
to establish and maintain rater agreement. To determine that this occurred, 
we compared. reliabilities obtained immediately after training (on a pilot 
test of independent ratings) and after the final ratings: Table 1 presents 
a comparison of general izability coefficients marking rater agreement 
levels on pilot and final ratings. 

•> Insert Table 1 here 

* ♦ 

Tfie first rating procedure was employed in Study 1 where Spooner-Smith 



13,9 



Table 1 

Comparison of General Izability Coefficients for Rater Agreement 
Immediately After Training and After Final Ratings. 

t 

Study 1 -'Expository Scale I (Spooner^Smith, 1978) 







F Dev 


0 


" Su 


Pa 


M 


Total 






GC GC 


GC 


GC 


GC 


GC 


GC 


Pilot 


- 4 raters n=15 .94 .92 


.94 


.83 


.94 


.80 


.90 


rial 

r i nal 


- 2 ratings n 


= 112 .84 .80 


.85 


.85 


OA 


.9b 


on 
• 90 




Study 2 - 


Expository Scale II (Quellmalz and 


Capell 


, 1979) 








GI F 


0 


S 


M 


Total 






< 


v^GC GC 


GC 


GC 


GC 




■ 


Pilot 


- 4 raters 


.74 .63 


.74 


•7? 


.73 






Final 


- 2 ratings . 


. .67 .59 


.61 


.57 


.52 


.66 








Narrative Scale II 
















GI . F 


0 ' 


S 


M 


Total • 








GC GC 


GC 


GC 


GC 






Pilot 


- 4 raters 


.86 .76 


.79 


.76 


.52 






Final 


- £ ratings 


.84 .60 


.72 


.72 


.69 


.83 






% Studv 3 - 


Expository Scale III (Baker 


and Quellmalz 


• 

, 1980) 








GI Geri Comp 


Con 


Po 


Su 




Total 






GC GC 


GC. 


GC 


GC 


GC 


GC 


Pile*. 


- 3 raters 


.74 .65 


.86 


.93*. 


.84 


.71 


.89 


Final 


- gratings 


.66 .71 


.62 


.83 


.71 


.76 


.81 ' 






Narrative Scale III 
















GI Gen Comp 


Con 


Po 


Su 


M 


Total 






GC GC 


GC 


GC 


GC 


GC 


GC 


Pilot 


- 3 raters 


.83 .75 


.62 


.87 


.54 


.85 


.79 


Final 


- 2 ratings 


.70 .76 


.53 


.87, 


.67 


.68 


.81 



KEY 

GC = General izability Coefficient 

Study 



Study 1 (Spooner- 

Smith, 1378) 

F = Focus 

Dev = Development 

0 - Organization 

Su = Support 

Pa = Paragraphing 

M = Mechanics 

T - = Total 



2 (Quellmalz and 
Capell, 1979) 

GI = -General Impression 

F = Focus 

0 s Organization 

Su - Support 

M = Mechanics 

T = Total 



Study 3 (Baker and 

Quellmalz, 1980) 

GI , ,= General Impression' 

6en Comp = General Competency 

Coh = Coherence 

Po' = Paragraph Organization 

Su = Support 

M c Mechanics 

T- { = Total 



(1978) compared direct and Indirect measures of writing competence. Four 

raters received five hours practice applying an analytic rubric, Exposi- 

tory Scale I, to a set of papers representative of the -experimental' set. 

The top table presents Spooner-Smlth's interrater reliabilities for four 

raters on the pilot test conducted 1mmed1ately«tfttr training and on the 

final Independent ratings of the experimental papers. During the final 

Independent scoring, raters read, rated and discussed discrepancies on 

a common paper as a group approximately every hour to check adherence to 

... • / 

criteria. While the total score reliability on the final ratings .remained 

>• 

high, reliabilities of four of the six subscales dropped as much as .14, 
indicating some degree of rater drift from original consensus levels. 

The second rating procedure. occurred in Study 2 (Quellmalz & Capell, 
1979T~ which compared writing performance in different discourse and re- 
sponse modes. Following scale training procedures employed by Spooher- 
Smith (1978,), pilot tests of interrater reliabilities for two revised 
analytic rubrics, Expository Scale II and Narrative Scafe II, checked 
leveTof agreement of the four raters prior to final rating. Additional 
training occurred on any subscale where the general izabllity coefficient 
was less than .70. Dur|n|ftnal scoring, rater pairs read and discussed 
common, papers after every 20 Independent ratings. The two tables for 
Study 2 indicate, again, that agreement levels on the total scores were 
acceptably high, but that pliabilities on three of the expository sMb- 
scales deteriorated as much, as -^.20. The interpretation of the*se data 
was that the frequency and nature of the common check procedures were still 
not; curbing rater drift adequately. 

„ '• . * ^ 141 



Consequently ,Study 3 Implemented a revised rating procedure. Study 
3 (Baker & Quellmalz, 1980) investigated- the effect of modality of topic 
presentation on eighth grade writing performance. Three raters partid- 



pated In scale training for analytic Expositor/ Scale III and Narrative 
Scale III. Following a pilot test of Interna ter reliability, the three 
raters Independently sgored the experimental papers. Each paper received- 

_ _ . - - <^ " ' ' — " " 

two ratings. Common checks occurred every hour and were discussed by the 
entire group. 

As the two tables for Study 3 Indicate, agreement levels fall on . 
General Impression, but not on the GeneraT Competency rating. Rel1ab1l- 
1t1es plummetted on the expository Coherence ratings and on the Mechanics 
ratings of the narrative .scale. These comparisons of pilot and final re- 
liabilities for Study 3 suggested that the revised checking procedure was 
generally maintaining rater agreement, but still did not prevent drift on 
somasuhscafles. + 

In a more detailed Inspection of the enWrg6nce of rater drift in Study 3; 
we also compared reliabilities and mean scores on papers scored early 
and late in the rating sequence (seesTable 2). Table 2 presents the early 
vs. late comparisons for Expository Scale III and Narrative Scale III. 
On the expository scale, reliabilities across all rater pairs remain high 
(a .76 to .85) except on the General Impression and Coherence subscales. 
Parametrl^jgomparlsons of mean jscores on^ early vs. late papers did not 
reach statistical significance, but late scored. papers received slightly 
higher ratings than early scored papers. • 

Reliabilities on Narrative Scale III remained high on General Compe- 



TABLE ? 

Comparison of Early vs. Late Scored Papers 1n Study 3 



(Baker and Quel lmali, 1980) 

Expository Scale III 



• 


Inter-rater 
Reliabilities 




■ ■I 1 } 

Mean Scores 






Early 




Early 


Late 


t 


uenera i impress i on 


a 


.85 


.69 


X- 


2.28 


2.29 


Q7 




* 






S:D. 


1.07 


.85 




General Competency 

i 


a 


.75 


.77 


X 


2.20 


2.43 


•23 








S.D. 


.91 


.86 




Coherence 


a 


.78 


.57 


X 


2.39 


2.63 


.21 


# 








S.D. 


.88 


.90 




Paragraph. 


a 


. .87 


.86 


X 


2.03 


2.22 


.40 


Organization 








. S.D; 


1.05 


1.08 




*^ 

Support 


a 


-.78 


.76 


X 


2.99 


3.11 


.51 








. S.D. 


.85 


. .90 




Mechanic? 


a 


.67 


.82 




2.18 


2.99^ 


-1.08 


» 

f 








S.D. 


.85 


.76 




* Total « 


a 


.87 


.,85 - 




14.78 


15.89 


-1.06 










S.D. 


4.86 


4.49 


t 




n : 


»40 


re&L- 


n*M)L 




ie40_. 





Narrative Scale III 



General Impression 


.a 


.78 


.71 


X 

S.D. 


2.62 
- 92 


2.19 
.73 


2.31 


General Competence 


a 


.81 


•78 


S.D. 


2.54 
.87 


2.20 
.78 


J. 84 


Coherence 


a 


.77 0 


.46 


X 

S.D. 


2.60 
.99 


.2.31 
.59 


1.60 


Paragraph 
Organization 


a 


.93 


.85 


X 

S.D. 


2.22 
1.29 


2.03 
1.00 


.74 


Support 

i 


a 


.84 


.84 


X 

N.S.Q. 


2.82 
.97 


2.51 
.68 


1.68 


Mechanics 

• 


a 


.68 

6 


.80 


S.D. 


2 f 30 
!80 


2 *74 


, . .82 


Total 


a 

n* 


.90 

:40 


.86 
n=50 


Id.. 

n=40 


14.35 
4.94 


•1^.03 


1.49 



• 13 

7. ■ 

tence, Support, Mechanics and Total score. General Impression reliability 
dropped .08, Coherence dropped substantially (a .77 to .46) and Paragraph 
Organization fell (a .93 to .85). Contrasts of mean differences between 
early and late scored narrative papers- revealed a significant difference 
on General Impression ratings. ^Papers scored later received lower ratings \ 
than those scored earlier. All subscale scores were lower for late scored 
papers. These^findings- are consistent with other research (Godshalk et al., 
1966) that reported raters became more severe a,s scoring progressed. In '' 
Study 3, Expository papers were scored before Narrative papers, so late 
scored Narrative papers were at the very end of the entire scoring sequence. 

Inspection of the scoring data from the 'three studies suggests that 

' ' -t 

rater drift within a scoring session can occur and weaken scoring rigor. 
Raters' judgments waivered on some subscales more than others, signalling 
a need for more careful explication of criteria on those subscales" and 
practice! on their application. Since state-of-the-art procedures for 
controlling rater drift were employed and even refined in these studies, 
*the data implied the need to continue to examine methodologies for detect^ 

ing and preventing rater drift. 

« , i 

i 

Scale Stability • 

A validity* concern coordinate with maintenance of scale fidelity with-' 

i • 

1n rating occasion 1s assurance of judgment accuracy across ratirtg occasions. 
Standards of fairness and methodological rigor mandate that criteria apply 
uniformly across sets of raters and se'ts of papers* 

Prevailing practice does not seem' to. recognize stability as a technical 
problem. Large scale assessments t do'no£ routinely report and inspect a , 



> series of rater reliabilities for separate scoring sessions. Even re- 
liability indices are not sufficient, however. Comparisons of mean scores , 
on common papers should supplement reliability statistics. Scale stability 
could be demonstrated by comparing scores on a common set of papers given 
by. different rater sets trained separately, or by comparing scores from 
the same raters rating at different occasions." While we have not yet in- 
vestigated thi^ .phenomenori w^hin an experimental paradigm, we have, how- 

A / • ■ 

ever, inspected! scoring data gathered during the process of our other writ- 
ing assessment ^esearch^fn an attempt to understand the nature of variables 

influencing scalfe stability. 

\ 

Our Table 3 presents the means and standard deviations of essay scores 
given by two different rater sets to the same papers. 0 Raters A and B 

scored 30 expository-essays. ) Rater pairs 1, 2 and 3 rated these same 30 

«* ' \ 

essays in the course of Study 3. Rater, pairs 1, 2 and 3 were using Ex- 
pository Scale a revision of the analytic expository rating scale 
used by Raters A and B. Therefore only scores from those subscales that 

\ t- • * e 

' were not significantly changed were entered into the analysis. Agreement 

■» * 

levels were not calculated due to the 'small sample size. 

\ ■ . • 

Inspection of the. means reveals that Raters A and B gave generally 

higher gratings than Rater pairs 1; 2 and 3. Comparisons of mearis/for 

each subscale and the tota.l score were all significant. While the small 

' number of papers clearly limit interpretation of these data, they, do docu- 

» 

ment 7 that criteria definition arid application did change from one rating 
session to the next. 



Table 3 



Comparison of Essay 'Scores* Given by Different Rater Sets 
on Separate: Occasions 



Subscales^ 



General 
Competence 



X 

s.d. 
n 



Ratings; % - 



Occasion 1 
Raters A and # 

2.92 
- .62 
29 



Occasion 2 
: + ars 1-6 



rt.65 
. .38 
30 



2.77* 



df. 



57 



Paragraph. 
Organization s.d 

• . n 



j 

Support \ , a s.d. 

n 



2.19 
.98 
29 



2.76 
1.08 
29 



1.46 v 

.50 
31 . 



2.07 
.50 
32 * 
) 



3.67* 



3.25* 



58 



59 



Total 



s.d. 

-n 



11.81 
3.17 *. 
* 29 



8.97' 

1.76 

32 



4.38* 



59 



,* s 
Scores by rater pairs 1-6 were transformed from a score range of 1-6 to 
.1-4 tq permit analyses. t * 




In addition to looking at the scores different raters trained at 

separate occasions gave to the same set of papers we inspected intra- 

< 

rater agreement of scores a pair of raters gave to common papers scored 
at different sessions. Table 4 displays means ancj standard deviations 
of a rater pair (N) which participated in two different rating sessions. 



Insert Table 4 here 

.**>,. 

Jn Study 1, rater pairs M and N scored essays, from a general high school. 

population which were then "salted in" a set of college ^mission £s£ays 

read for Study 2. In Study 2, pair N read the eight essays they had scored c 

previously in Study 1 and 8 additional essays from that study that" they 

had not personally scored. The means of pair N in the /two studies are 

fairly comparable except on Support and Mechanics. In. contrast, the means 

of pairs M and 0 are substantially different. Pair 0 means are consistently 

lower. The greater stability of means for pair N may suggest that they 

were applying criteria in a uniform manner. Pair 0 was probably influenced 

« 

by the overall higher quality of the college admissions sample, thus making 
the ^salted in" general population high school, seem wonse. Methods for elim- 
inating this subtle "nprming" of presumably explicit criteria to the quality 
range of particular sample is a phenomena' requiring further research. 

.Our intent in inspecting these admitted!^ limited data was to illus- 
trate one method for tracking the stability, of rating scale application. 
Writing assessments could systematically include a "check". set of papers 
in each rating session to document the comparability of judges 1 decision * 



TABLE 4 



Comparison of Rater Pair Scores Across Studies 



Study 1 



Study 2 



-Rater-Pair 



CSE Subscale - 

General 
Impression 



Focus 



Organization 



Support 



Mechanics 



Total 



— - 


- M 


N 


N 


0 


X 


1.92 


i 

1.28 / 


i.oo 


.'94 


s.d. 


1.32 


1.37 A 


1.13 


-.91 


n 


6 


8 


16 


16 


X 


2.08 


1.71 


1.69 


1.53 


s.d 


.38 


.9' 


.48 


.50 


n 


6. 


8 


16 


16 


X ■ 


2.33 


1.65 


1.72 


1.38 


s.d. N 


.98 


.6 


86 . 


.50 


n 


6 


8 


16 


16 


X 


2.42 


2.76 


.2.00 


1.63 


s.d. 


.92 


1.15 


1.78 


.50 


n 


6 


8 


16 


16' 


X 


2.50 


2.20 


1.78 


1.75 


s.d. 


.84 


.70 


.77 


.58 


n 


6 


8 


16 


16 












X 


11.25 


9.60 


8.19 


7.21 


s.d. 


3.71 


2.79 


3.40 


2.53 


n 


6 


8 


16 


16 






9 







148 



rules at different rating sessions. We believe that scale stability 
across topics, quality range of papers and sets of raters can be achieved 
and that the factors influencing scale stability require systematic in- 
vestigation. 

Summary and Recommendations * 

The need for stabilizing the scoring process is critical to the val- 
idity of writing assessments. Djrect evidence of student writing compe- 
tence, actual written production, is a necessary condition* for content 

k * * 

and construct validity; it is not sufficient, however. Rater"s judgments 
must be replicable and defensible. We believe that explicit rating cri- 
teria are a condition for defensibility and replicability. Our rater drift 
comparisons suggest* that total scores and a holistic score seem to mask 
fluctuations in judgments on the elements that contribute to the more 
global summary scores. We suspect that, at least during scale ^development 
-and validation, assessments should collect separate ratings on component 
text features such as Support and Coherence that contribute to a total 
score. Otherwise,* there is no way to identify and track consistency of 
tne bases for global judgments. 

Certainly, scale training and an-jnitial reliability check is essen- 
tial. Rather than relying primarily on randomization or statistical pro- 
cedures to correct for rater drift post hoc , rating methods should inter- 
.sperse periodic checks into lengthy, independent; scoring. The variables 
friaking these checks effective for maintaining agreement and scale fidelity 
require further investigation. Frequency of checks is one important factor; 
the nature of feedbacf^on scoring accuracy is even more essential. We 
> 

143 . 



are currently conducting research, on methods for curbing rater drift* 

Scale stability is a critical validity issue for competency-based 
writing assessment. Large scale assessments can, at least, document 
stability by tracking scoring of a core set of papers by different groups 
of raters. Methodologies for selecting and preventing scale instability 
should also receive direct experimental. attention* Fair, informative, 
general izable, defensible scoring procedures are necessary-requirements' 
of sound writing assessment* 



t 

' ' • i 

i. 

i References 



Baker, E. L. , & Quellmalz, E. S. Issues in eliciting writing performance: 
Problents in alternative prompting strategies. Paper presented at the 
annual meeting of the National Council on Measurement in Education, 
Boston, April 1980. 

c 

Braddock, R., Lloyd-Jones, R., & Schoer, J. R esearch' in written composition . 
'Urbana, 111.: National .Council of Teachers of English, 1963. 

Coffman, W.* E.; Essay Examinations. 'In R. L. Thorndike (Ed.), Educational 

Measurement; C2nd* ed. ) . Washington, D. C: American Council of Education, 1971a 

Coffman, W. E.» _0n the reliability of ratings of essay examinations in 
English. Research In the Teaching of English , Vol. 5(1), Spring 1971b. 

Corjlan, G. How* the essay in the CEEB English test is scored . Princeton, 
N., J.: Educational Testing Service, 1976. 

» ^ ,< 

Crowjhurst, M. Syntactic complexity in narration and argument at three 

grade levels. Canadian Journal of Education , 1980; 

„ . ■ - ■ - — ■ — - 

DiedeTHch, P. B., French, J. W v & Carlton, S.' Factors in judgments of 
^ writing ability . Princeton, New Jersey: Educational Testing Service, 1961. 

Diederich, P. B. Measuring growth in English . Urbana, 111.: National 
Council of Teachers pf English, 1974. 

Eels, W. C. Reliability of repeated grading of essay-type examinations. 
Journal of Educational Psychology , 1930, 21. - ' 

Findlayson, D. S. The reliability of the marking of essays. British Journal 
of Educational Psychology , 1951, 21, 126-134. y 

Follman, J. C, & Anderson, J>. A. An investigation of the reliability of 
five procedures for grading English themes. Research in the Teaching of 
English , 1967, 190-200.. . , 

Freedmafn, S. How characteristics of student essays influence teachers 1 
evaluation. Journal of Educational Psychology , 1978, 70. 

Godshalk, F. E., Swineford, F., & Coffman, W. E. The measurement of writing 
ability . Mew York: College Entrance Examination Board, 1966. 

• 

Kinneavy, J. R. A theory of discourse. In the Aims of Discourse . Englewood 
Cliffs, N.J.: Prentice-Hall, Inc., 1971. 

McColly, W. What does educational research say about the judging of writing 
ability? The Journal of Educational Research , 64, No. 4, December 1970. 



v £ 151 



Office of the Los Angeles County Superintendent. of Schools. A common 
ground for assessing competencies *jn written expression* review copy . 
. Los Angeles: Division of Curriculum arid Instructional Services, 1977. 

> 

Pittsj M. The relationship of classroom instructional characteristics 
and writing in the descriptive/ narrative mode . Report to the National . 
Institute of Education, Los Angeles: UCLA Center for the Study of 
Valuation, 1978. (Grant Ho. 0B-NIE r G-78-0213) 

Prater, D., & Padia v W. Effects of modes of discourse in writing perfor- 
mance in grades four and six. Paper presented at the annual meeting 
of the American Educational Research Association, Boston, 1980. 

Quellmalz, E. S. Interim report. Defining writing domains: Effects of 
■ discourse and response mode. Center for the Study of Evaluation, Uni- 
versity of California, Los Angeles, 1979. 

Quellmalz, E. S., & Capell, F. Defining writing domains: Effects of dis- 
course and response mbde. Report to the National institute of'^Wucation, 
November, 1979. (Grant No. OB-NIE-G-78-0213 to the Center for the Study 
of Evaluation) 

Spooner- Smith, L. Investigation of writing assessment strategies . Report 
to the National Institute of Education, November 1978. (Grant No* OB- 
NIE-G-78-0213 to the Center for the Study.of Evaluation.) 

Stalnaker, J. The construction and results of a twelve-hour. test in English 
composition. School and Society , 1934, 39. 

Winters, L. The effects of differing response criteria on the assessment 
of writing competence. Grant No. OB-NIE-G-78-0213, Los Angeles, Calif- 
ornia: Centef. for the Study of Evaluation, November 1978. 



ERJC 



152 



EFFECTS OF VISUAL OR WRITTEN TOPIC INFORMATION 
" , C ON ESSAY QUALITY 



Eva L. Baker and Edys Quellmalz 



Center for the Study of Evaluation 

Graduate School of Education 
University of California, Los Angeles 



The project presented or reported herein was performed 
pursuant to a grant from the National Institute of Edu- 
cation, Department of Education. However, the opinions 
expressed herein do not necessarily reflect the position 
or policy of the National Institute of Education, and 
rio official endorsement by the National Institute of 
Education should be inferred. ' 



153 



When one* considers the critical properties a test must have, the 
element of validity is almost always discussed* Validity means the test 
measures what it is supposed $0, and, as an essential corollary, makes 
that measurement fairly. Fairness or equity involves the kind of chances 
students have to demonstrate their competency. In some cases, test equity 
is' also inferred fr,om,the~ shape of the resulting distribution of scores 
for different student groups. The task of writing assessment is a partic- 
ular and special subset of the testing process. Writing assessment con- 1 ' 
tains three special features that increase the importance of its research: 
1) Writing skill is recognized by the public and by educators as a critical 
goal of schooling; 2) the study of writing provides access , to the study of 
cognitive processes; 3) writing assessment serves as a case from the larger 
set of constructed respons.es in achievement testing, a set to which rela- 
tively little psychometric theory has been applied. 

Although the use of actual student writing samples seems the most ob- 
vious way i to obtain estimates of student performances, it is loaded with 
practical complexities. Unlike multiple choice tests, which appear to be 
fixed artifacts, essay-based assessment appears less constrained in the 
writing tasks themselves, particularly in the actual directions or prompts 
given to learners and the|^criteria used to score performance. Of course, 
differences in either of- these dimensions greatly affect the inferences -we 
make. Were writing assessment to remain the, domain of individual teachers, 
as they privately teach and assess, we would expect that idiosyncracies of 
tasks and scoring schemes in particular classrooms would be balanced over 

& 

154 



time by the number of different teachers to whom any given student was 
exposed. Yet, writing assessment assumes public ratheV than private pro- 
portions, as exemplified by competency testing for high school graduation 
and statewide assessment programs. The public functions of writing assess- 
ment bring with them tfie necessity to develop and to display to students 
and teachers the specifications guiding the preparation of writing tasks. .» 

, From a research perspective, moreover, writing assessment presents a 
special opportunity to understand the learning process. By studying in- 
formation requirements, the effects of cues and supports given students, 
the technical aspects of assessment can be improved and desirable features 
identified for inclusion in writing instruction. Writing, as much as'any 
school -trained activity, shows us how students think, how they organize 
information, and how they understand subject matter. 

Our work on writing assessment related to a general assessment frame- 
work. This framework includes elements relating to 1) social and intel- 
lectual motivation; 2) student ability and information; 3) features of 
tasks; 4) criteria used'for scoring. (See Figure 1) Although these cate- 

Insert Figure 1 here 

gones could be elaborated almost infinitely, our research focuses on ele- 
ments endogenous to the writing tasks , such as cues, modes of discourse, and * 
information. 

Tasks for Writing Assessment 

The first problem of writing assessment is the selection of a topic, 

r 

v 

1DO 




CONCEPTUAL' FRAMEWORK: TEST DESIGN 



Antecedent 
Conditions 




' Task • 
Eli citation 
Conditions f 



t 



Reporting 




Analysis 
Techniques 



Information .System Linking,. Testing and -Instruction' 



Figure 1 ' 



SERJC 



'156 



/ 

direction, or prompt to elicit studfent composition, ^ft is common to give 
assignment's on topics that students have some information about but which 
avoid systematically advantaging particular students wit'h specialized know>- 
edge.^ Thi|, "nodding acquaintance" mode' of topic identification results in 
bland, general topics, -such a$ "My Street" ~ topics unlikely to generate- 
r6al enthusiasm and meaning for students. Nonetheless, particular students 
are thought to.be about equally prepared on such topics. In addition, 
•these inoffensive assignments will not disturb parents or school .adminis- 
trators. The "general" expeVience tapped by such dasks corresponds to the 
"general "frames" describee! by Minsky (1975) and Anderson (1972) In 
research on the cognitive aspects of« reading comprehension. These frames 
consist es sentially of certain general information c^nd common referents 
prerequisite to the ability to u$e special tactics t^rpnerate responses. 
In particular, the reliance on common frames or referent "general ex- 
perience" becomes more and more risky as students, come from more diverse 
backgrounds. With less shared experiences among students, writing tasks 
themselves may need to provide information for student > tb write about. 

Me are led to the simple question of whether possession of specific in- 

* 

.formation affects students 1 ability to write. In other words, can students" 
demonstrate their, writing abflity when they hav.e little specific information 
to convey? Or, is writing v fluency independent of specific information apd 
related more to the general or common referent information? Most importantly, 
how can students be supported so that they can display their writing com- 
petency at its highest level? . " . 

To explore these questions, we need to -provide specific information 



157 



f 

> 



\ 

* 

to students so that they may have "content" for composing. How can one 

4 * • 

go a|>out this task? English composition teachers have attempted to pro- 
vide sufficfent information to learners to enable them to write know]- - 
edgeably and productively. Such efforts mSst often take the form of an'' 
extended set of written directions which sets the context and audience 
and provides .a brief discussion of the purpose of the essay, 6ne limita- 
tion of this fprm,of prompt, the "extended written passage,". is the read- 

ing comprehension load pladed.upon the would-be writer, a particular dis- v 

f j 

advantage foe a large number of poor readers. Another consequence frequently 
noted by teachers is imitation. Students may mimic the form and style of the 
extended prompt itself.* The .strongly styled prose instructions may jeopard- 

is » 

ize students' writing and, consequently, the accuracy of the assessment 

*- * . 

effort. * ^ 

/ " 

Becayse pictures convey information that is both general and specific, 



V- { 



they have been'used formally in somfc assessment settings as a way around ex- 
cessive* reading burden. In fact^tevin and Lesgold (1978) describe' the^facili 
tation properties of pictures in reading comprehension tasks. If pictures 
enhance comprehension differentially for poor comprehenders, then the use of 
picture prompts* as substitutions or elaborations for prose directions should 
positively affect writing performance, particularly for students with poorer 
writing skills. , , 

Overview . 4 

The study represents a start in unraveling the relationships between, 
information in writing tasks and writing performance. ' In this research, 



writ 

JLl 



The phenomenon is much. like a tendency to write short, zippy declarative 
sentences after reading Hemmingway, convoluted and sentimental, passages after 
reading Dickensy~or to forego the rules of capitalization after confronting 
e. e. cummings. * * 

158 



the information provided in picture .prompts for eighth-grade writing 
tasks which were varied in two modes of discourse r exposition' and narra- 
tive, was compared with written bromptS/ Although/pictures tend to be ^ - 
associated with narrative or descriptive writing, the frequency of arid x 
generally dismal performance on expository tasks led us to test picture 
prompts In expository tasks as we|l 3$ descriptive writing. An experi- 
ment was conducted where* eighth gr|ade students were' randomly assigned to 
receive either a picture or wri^te^ , prompt and either an expository or 
narrative writing task. Students 'also completed a test of reading ability 

Results were assessed using both general and analytic Scoring schemes and 

* \ * 
were examined for students of differed reading achievement levels. 

Subjects and Sampling 

'Students were sampled as part of an -evaluation of eighth grade achieve 

ment in a study of Califprnia educational reform. The eighth grade level 

* * \ 

„was chosen because,. at that grade, students oftea write both narrative 

and expository compositions. Eighteen schools wer^ sampled, to represent 
a seJjeme stratified by^ school size, percentage of l<y or non-English speak- 
ing students California geography, and (socioeconomic. status (SES). Two 

v 

heterogeneousiy grouped classrooms in ea£h school at the eighth grade were 

i ' 

randomly selected. 

/ f 

Instrumentation , 

Writing Task . Because the different information conveyed by picture- 
and prose is a matter of aphorism, the study made no attempt to equate -con T 
crete or general bits of information in these two stimulus classes. } 



: 159' 



Within each classroom, students received one of fdur treatments: , 
• * 

1) picture prompt/directions for .exposition (PE)j 2) picture prompt/ 
directions for narration (PN); 3) written prompt/for exposition (WE)'; 
4) written prompts/for narration (WN). 

Reading Test , The* reading test consisted x>f 68 items and was com- 
posed of three subscalas: vocabulary (21 items), literal comprehensibn 
(24 items), and inferential comprehension (23 items). The tests were 
used to assess reading skills in a statewide evaluation study. Tests 
were generated using domai-n-ref erer ~ed testing procedures (see HiVely* 
1974; Herman* 1977; Baker, 1977). Spearman-Brown coefficients .were com- 

V 

.puted and coefficients .76, .80 and .72 were obtained respectively, for 

t 

the three subscales. The Spearman-Brown coefficient for the total 68 

* i I 

item test was .92. 'The difficulty by subscale was .64 for Vocabulary, 
♦ 65 foMiteral comprehension, and .64 for the inferential comprehension 
subscale.. 

. Scoring Systems for Dependent Measures / In previous studies, (Winters 
1978; Spooner-Smith, 1978), the utility of alternative scoring strategies, 

t 

e.g., holistic and analytic procedures, was assessed using high school 
and college populations. Since this sample populatioh was younger, a pilot 
study of scoring procedures was conducted using thirty papers, selected 
at rarfdom from the entire sample. On the basis of this pilot study (Baker 
& Quellmalz, 197-9), the use of T.-unitsas a dependent measure was excluded 
4 for this research. A general 7 impression (GI) or holistic assessment, was 
given to each essay score, using a six-point scale in the pilot study. In 
addition, *<an analytic scoring rubric was applied to each essay on schema 

. * ' ' 160 ■ , 



previously employed in other writing research (Pitts, -1978; Quellmalz, 
1979; Winters, Spo.oner-Smith, op_. cvt.). This rubric consists of four 
subscales (of six-point range) on structural features of the essay: 
1) paragraph organization, 2) coherence, 3) support and 4) mechanics. 
"These subscales are combined into a General Competency scale. Previous 
work (Quellmalz & Capell,- 1979) identified, the relativejndependence 
of the scales. 

Socio-economic Indicators . The State of -California has no direct 
index of S^S for its secondary school populations. Instead, the percent 
of students receiving Aid to Families wi tfc Dependent Children (AFDC) is 
used as a proxy. 

Procedures 
— t — 

Teachers were directed to administer the test of reading comprehension 
pn the first day of testing during the spring. On the second day, students 
received a writing ta^k to complete in forty minutes. The "four treatments 
were randomly' assigned within classroom. Teachers returned by pre-paid 
mail coded 'student response booklets to the researchers. 

Rater Training . Three pairs, of raters were trained to use this rubric 
on expository and narrative sample papers.* Six hours were spent training 
and practice scoring in the expository mode until an acceptable level of 
agreement was reached (alpha « .83, general izability coefficient « .89). 
Al"! expository papers were then rated in a group. During a subsequent 
session, practice was provided in applying the rubric |p narrative papers 

*Student paper length varied from one-half page to two pages. Student, 
papers were not retyped for scoring. 4 

161 



until acceptable concordance was reached (alpha = .83*; general izability 
coefficient .79). Raters first gave each paper a general impression rating, 
which included estimates of the style, creativity, structural- and'meGhan- 
ical features of the paper. Next, they gave a general competency score 
with Vegard to th.e subScales - coherence, support, paragraphing, and mechan- 
ics. Last, the raters scored each subscale. All scores were assigned 
from a 11 1" to 11 6" scale'. 

Analysis 

v Data were returned from thirteen of the eighteen schools. Because of 
constraints. about information for individual students in this sample, only 
school level data were available regarding SES and language dominance. Ac- 
cording to a comparison at the sample means with the statewide means, the 
attrition (the. -loss of five schools) occurred in those schools with lower 
SES and higher percentages of low English speakers. The distribution of 
AFDC in our total sample was 10% and in the returned sample 6%, indicating 
that the drop-out took place in low SES schools. In addition, the percentage 
of low English speaking students was reported as l£ss than one percent for 
our sample, much lower than for the entire California population. Clearly, 
the lower achieving side of the sample did not return the measures. Although 
the school level/student level SES data problem was clear, and we, had hoped 
that some indication of SES might assist us in our analysis, we found^fat 
our test of reading achievement correlated .62 with AFDC. This correlation 
corresponds to relationships found over the last' few years of studies of 
achievement and SES in California elementary schools (Baker, 1976; Baker & 
Herman, 1977; Baker, Herman & Yeh, 1978; Quellmalz*& Baker, .1981). 



162 



Experimental ^Contrasts . Students* writing performance was assessed 

using 2x2x2 design, employing two* levels of reading scores, high and low 

V 

r»plit at the mean, two types of prompt, and two types of discourse task, , 

* » 
Data were separately analyzed for the General Impression scale (GI), 

for* the General Competency scale, for the four, subscales of the analytic 

rubric, and for the total . Means and standard deviations for each variable 

were computed .and multiple classification analyses of variance were per- 

t 

formed. ( ■ - * * 
Results , 

Means and standard deviations of the dependent measures by\blocking 

factors are presented in Tables 1 through 24, Because of the large number 

of analyses conducted, some significant findings would be detected by 

chance alone. Only those findings which show up consistently across mea- 

sures will be discussed* . t 1 

Overall Findings . The salient feature of the tables is the relatively 

poor performance of students in writing competence. Raters were instructed 

to use "4" as a score of sufficient competency or, oxymoronicajly,, minimum- 

mastery. No average, either for high scoring readers or for any treatment 

variation, is 4 or higher. This finding is particularly depressing in the 

ligfit of the drop-out analysis and the inference that schools with poorer 

performing students did not return the measures for scoring. 

♦ Holistic Scorin g. Two forms of holistic scoring were used: the General 

Impression CGI) scale, which included style and other "intangibles" such 

as creativity, and the General Competency score, an estimate of the total 
* • 

v 

» 

163 



.11 

"competency" of the paper with regard to the features of -the analytic 
1 scale. Recall that these two scores were given before detailed analytic 
scoring took place. "Using the total reading score as a blocking factor, 

Insert Tables 1-8 here 

a significant two-way interaction was found on Gtf for mode of discourse 
and reading ability (F = 4*17, df = 1,212, p = .04). Inspection of Table 1 
suggests that performance by the low readers on narrative* tasks was par- 
ticular poor. With regard to the General Competency score, mode of dis- 
course and prompt form.significantly interacted (F =, 4.50, df = 1,212,. 
p = .04) with the inferential comprehension blocking factor (see Table ]) 
and missed by a little (F = 2.58, df = 3,212, p = .055) on all other 
blocking runs. A speculative interpretation is that written expository 
and picture narrative combinations are most facilitative on General Com- 
petency * Inspection Tables 1-9 suggests that these. efffects were true 
for the better readers, a speculation supported by the thr^ee-way interaction 
in Table 7, (F = 4.08, df = 1,212, p = .045). The findings for the Coherence 
dependent measure, (Tables 9-12) display a significant main effect for 
prompt type (F - 6.12, df = 1,212, p = .01) in favor of pictures. This 
finding is also replicated on the vocabulary subscales (F = 5.92, df = 1,212, 
p = .016). Table 11 provides an interesting case where the poor readers 
out perform the good .readers under the picture-expository conditio^ by 

about ,5 standard deviation, out perform low reading groups without pictures 

/ 

by a greater margin, and about equal the high reading groups who do net .have 

Insert Tables 9-12 here 





General Impression 



• Table 1 
Total Reading 



Picture 



Written 



Hi oh X 
Total . sd. 
Peading n= 


2.36 
0.80 ' 
33 


2.92 
0.89 
30 


" 2.51 • 
1.07 
35 


2.53 
0.61 
29 


Low ^ X 


.2.18 


2.06 


2,12 


1 .94 


Total sd. 


0.61 


0.47" 


1.08 ( 


0.63 


Peading' n= 


19. 


24 


20 , 


23. 










- J —f 



High , "X 

Total sd. 
Inferential n= 
..Comprehension 

Low " X 

Total sd 
Inferential n= 
ComDrehension 



Table 2 

Inferential Comprehension 
Picture "Written 
itory • Narrative Expository Narrative 



2.34 5 


2.83 

t 


2.55 


2.47 


0.81 


0.89 


1.05 


0.61 


28 


30 


30 


33 


2.25 


2.17 


2.20 


• 1 .92 


0.<66 


0.62* 


1.10 


D.67 


24 I 


24/ 


• 25 • 


19 



High 

Total 

Compre- 
hension 

Low 

Total 

Compre- 
' hens ion 




X 

sd. 

n= 
X 

sd. 
n= 



Picture 
Expository Narrative 



Table 3 
Comprehension 



2.34 
0.79 

28 



2.25 
0.68 

24 



165 



2.85 
0.89 

26 



2.25 
0.7O 

28 



Fxpository 

2 "" 



.59 
1.11 

29* 



Written 

Narrative 



2.17 
1.02 

26 



2.46 
0.64 

25 



2.09 
0.68 

27 



Picture 



Table 4 
Vocabulary 



Written 





Expository 


Narrative 


Expository 


Narrative 


High 


X 


2.44 


2.870 


2.79 


2.42 


Total 


sd. 


0.79 


0.967 


1 .17 


0.58 ' 


Vdca biliary 


n= 


23 


27 


24 


' 26 


Low 


X 


"2.19 


2.20 


2.08 


2.12 


Total 


sd. 


0.69 


0.54 


t 0.91 


0.75 


Vocabulary 


n= 


29 


27 


31 


26 



166 



Table 5 

Total Reading 
Picture Written 



Peneral Competency 



Table 6 

Inferential Comprehension 
Picture Written 







ExDositorv 


Narrative 


Expository 


Narrative 




Expository 


Narrative 


Expository 


Narrative 


High 


X 


2.39 


2.88 


2.59 


2.47 


High X 


2.32 


2.75 • 


2.63 


2.44 


Total 


sd. 


0.66 


0.88 


1.03 


0.61 


Total ' sd 


. 0.63 


0.91 


1.04 


0.62 


Reading 


n= 


33- 


30 


35 


29 


Inferential n= 


28 


30 


30 


33 












Comprehension 










low 


X 


2.13 


1.96 


2.23 


1.94 


Low X 


2.27 


2.13 • 


2.24 


1.87 


Total 


sd. 


0.62 


0.49 


0.91 


0.76 


Total sd 


0.69 


0.66 


0.90 


0.76 


Readi ng 


n= 


19 • 


24 


20 


23 


Inferential n= 


- 24 


24 


25 


19 












Comprehension 




L_— 


• 





Picture 



Table 7 
Comprehension 



Wri tten 



High 
Total 
^.Compre- 
hension 

Low 
Total 
Compre- 
• hension 

ERIC 



x 

sd 
n= 

X 

•sd 
n= 



Expository 


Narrative 


Expository 


Narrative 


2.67 


2.79 


2.72 


2.42 


■ 0.63 


0.85 


1.08 


0.69 


28 


26 


29 


25 


2.33 


2.18 


2.15 


2.06 


0.69 


0.77 


0.80 


0.73 


' 24 


28 


26 


27 



167 



Hi gh 
Total 

Vocabulary 



Picture 



Table 8 
Vocabulary* 



Written 



X 

sd 
n= 



Low X 
Total sd, 
Vocabulary, n 



Expository 


Narrative 


Expository 


Narrative 


2.46 


2.815. 


2.79 


2.44 


0.67 


0.97 


1.05 


0.61 


23 


27 ! 


24 


26 


2.17 


-A — 

2.13 


2.19 ■ 


2.02 




0.57 


0.87 


0.78 


29 


27 


31 


26 



,163 






Table 9 . 

Total Reading 
Picture 



Coherence 



Written 



High X 
Total sd 
Reading n= 


2.59 
0.70 
33 • 


. 3.02 
' 0.98 
30 


2.59 
0.97 
35 


2.43 
0.55 
29 


c 

Low X 


»■ ■ 
2.66 


2.33 


2.35 


* 2.13 


Total sd 


. 0.55 


0.58 


1.09 


0.51 - 


Reading n= 


19 


1 24 . 


20 


* 23 • 



Table 10' 

Inferential Comprehension 
Picture Narrative 
tory Narrative Expository Narrative 



High X 
Total sd. 
Inferential n= 
comprehension 


2.52 
0.73 
28 


2.95 
0.99 ■ 

.. 30 


2.62 
1.02 
30 


2.46 
0.51 
33 


Low X 


2.73 


> 

2.42 


2.36 


2.03 


Total sd. 


0.53 


0.64 


1.00 


0.51 


Inferential' n= 


24 


24 


•25 * 


19 


Comprehension 











High 

j Total 
Compre- 



Table 11 
Comprehension 



Picture 

Narrative 



Written 



Low 
Total 



hension 



X 


2.48 


3.04 


2.72 


2.40 


High 


X 


sd 


0.71 


0.87 


0.96 


0.48 


Total . 


sd 










Vocabulary 


n= 


n= 


28 


26 


23 


25 






X 


2.77 


2.41 


2.25 


2.20 


Low 


X 


sd 


0.54 


0.80 


. 1.02 


0.59 


Total 


sd 


n= 


24 


28 


26 


27 


Vocabulary 


n= 




163 



Picture 



Table 12 
Vocabulary 



Written 



2.70 


2.96 


2.69 


2.31. 


0.75 


0.94 


1.07 


0.45 


23 


27 


24 


26 


2.55, 


2.46 


2.36 


2.29 


' 0.56 


0.77 


0.95, 


0.64 


29 


27 


31 


26 



170 



15 



pictures. Obviously the data are only exploratory, but such findings, 
if replicated, would suggest a compensatory role for picture-simulated 
expository writing. The three-way interaction, significant beyond tjje 
.61 level (F = 7.87, df = 1,212), suggests a disordinal relationship 
where pictures facilitate the high readers' narrative production and 
affect the low readers' expository 'performance. A mode of discourse by 
.inferential comprehension interactions alsc^ significant' (F = 4.50, 
df = 1,212, p = .035) and suggests ffiat poor readers have less success- 
v#th narrative tasks. (See Table 10.) No differences were detected on 
the paragraphing subscale. 

. For the Support subscale, where u§e of examples and* details are as- * ' 
sessed, the findings are the most consistent.. With the total reading score 
as a blocking factor (Table 13), mairj effects are found for prompt form 



Insert Tables 13-16 here Y ' 

. • ... r 4 . • . 

(F = 21.32, df = 1,212, p'= .0001), for mode of "discourse (F = 11.96, 
df = 1,212, p = .001), and a mode x prompt interaction (F = 5.02, df = 
1,212, p = .026). In Table 13, we are struck with findings that suggest, 

with pictures, low readers perform equivalently tp} high readers in the' - 

4 

expository mode and superior to high readers in any other treatment/ In 

V 

addition, it appears that readers make special use of pictures in the nar- 
rative mode. This pattern of findings is repeated with Inferential Compre- 
hension scores as a blocking factor and an additional two-way interaction 
between mode of discourse and reading level is found. Again, these find- 

ings support the "special" use good readers make of pictures in the nar- 

* 

rative mode (Table 14). 7 

171 



7 '•■ • /• 





■a . „r 



r S 



'High 
Total 
Reading 



low 

Total. 

Reading 



/ Table 13 

/Total Reading 
Picture 
Expository A* Narrative 



'Support 



Written 
Expository 'Narrative 



X 

sd 
n= 



X- 

sd 

n= 



1 

3.46 / 
0.65/ 

- 33 V 

/ 


3.13 
• O.Sl 
30 


2.86 
0.88 ' - 
35 • 


2.71 
0.61 
29 


' 3/40 

■ /,19 

/ 


2.42 
0.57 
24 ' 


' 2.50 
0.86 

20 " f 


' 2.41 " 
0.72 
23 



/ Table 14 

/ Inferen^ia-1 Comprehension 

• Picture \ Written 
Expository Narrative Expository Narrative 



High . > 
Total s 
Inferential r 
Comprehension 

Low • • ' ) 
Total £ 
Inferential r 
Comprehension 

/ 



3.47 
i 0.66 
28 


3.12 
0.91 

/ 30 


2.83 
0.87 

30 • 


2.73 
0.61 - 

,33 


3M 


'2.44 ' 


2.60 


2.32 


d 0.65 


0.60 


0.89 


0.69 


= ' 24 


24 


25 . 


19- 



/. 



Table 15 

. *, Comprehension \ 
Picture Written 
Expositor y • Narrative Expositor?^ Narrative^ 



.High • X 

' Total sd 

Compre- n= 
- Jfension 

k.- 

Lew X 

Total sd . 

Compre- n= 
• hension 



9 

ERIC 



3.47 
0:58 
28 



3.42 
,0.73 
24 



172 



3.12 
0.85 
'26 



2.54 
0.76 
28 



2,83 
0.-90 
29 



2.62 
0.86 
26 



■2.76 
0.61 
25 



2.41 
.0.68 
27 



. . . Table 16 

Vocabulary 

Picture 
Expository Narrative 



J Written > 
Expository. Narrative 



High X 
Total' sd. 
Vocabulary n= 


3.52 
.0.61 
23 


3.11 

0.92 - 
27 


3.02 
0.93 
24 


\ 2.65 
\ 0.58 

\ 26 

\ 


Low X 


3.36 


2.52 


2.50 


2.50 


Total sd. 


0.68 


0.66 


0.79 


0.75 


Vocabulary n- 


29 


27 


31 • . 


26 

t 








173 


a\ 



The Mechanics subscale^provides a different sense of the treatment 



InserNables 17-20 here 

effects. Under all blocking conditions, a main effect is found for mode 
of discourse. Jables 17 to ^consistently display an association of the . 
narrative mode with poorer use of mechanics (spelling, syntax, punctuation). 
With the Inferential Comprehension blocking factor (Table 18), prompt form 
is significant, favoring written prompts. Perhaps the presence of pictures 
encourages rapid, sloppy execution of sentence structure. 



. Insert Tables 21-2\ here 

r . ■ 

On.the total writing score, composer of ,bot\ holistic and analytic 



scoring procedures, a reading level by mode of discourse interaction effect 
is found (F = 4.38, df =1,212, p = .038) -and that fining is evidenced 
either significantly or marginally (p = .06) in the other blocking analyses, 



Summary 

With caveat underscored, the summary, of these data are as follows: 

1. Sampled' eighth grade children'^ writing ability, whether scored 
holistically or analytically, stimulated by picture or written 
prompt, and with either a narrative or expository task, is below 
minimal levels -of competency. 

2. 'Picture prompts generally facilitate writing, particularly for 
those subscales which emphasize content detail and coherence. 

3. Picture prompts differential ly facilitate good and poor readers 1 



Mechanics 



Table 17 
oTotal Reading 



Table 18 



i 



Picture 



Written 

)ry Narrative 



High X 
Total sd 
Reading.. n= 


• 3.08 
. 0.79 
33 ' 


2.63 
0.91 
30 


3.15 
0.92 
35 


2.41 
0 S 54 
29 


Low X 


, 2:78 


- 1 .92 


2.78 


1 .89 


Total sd 


' 0.70 


0.58 


0.94 


0.71 


Reading, n= 


1-9 • 


24 


20 


23 * 



Inferential Comprehension 
Picture ✓ Written r 

Expository Narrative Expository Narrative 



High JT 
TotaT > ;sd. 
Inferential n= 
Comprehension 

Lov/ X 
Total ,/ sd 
Inferentia/1 n- 
Comprehension 



•3.03 


■ 2.52 


3,21 


2.36 


.0.80 


0.90 


0.98 • 


0.56 ■ 


28 


30 


30 


33 


2.90 


2.06 ' 


2.79 


1 .87 


0.73 ■ 


0.74 


0.84 


0.72 


24 ' 


24 


25 


1? \ 




Picture 



Table 19 
Comprehension 



Written 



High X 
Total ' sd 
Com pre- n 
hension 

Low X 

Total s 

pompre- n 
hension 



ua pub i tui y 

i: oo 


» 2.54- 


3.24 / 


2.36 


0.84 


0.86 


1 .00 


0.64 


28 


26 


29 


25 


2.93 


2.11' 


2.77 


• 2.02 


0.69 


0.81 


0.80 


0.66 


• ' 24 


28 


26 


27 











175 , 



High ■ 
Total . 
Vocabulary 



X 

sd, 
n= 



Picture 



Table 20 
Vocabulary 



s 

Wri tten 



Expository 



Low X 
Total sd. 
Vocabulary n= 



3.14 
0'.80 
23 



2.83 
0.72 
29 



Narrative Expository Narrative 



2.54 
0.98 
27 



2/.09 
0 ; .65 
27 



3.36 
0.86 
24 



2.75 
0.92 
31 



176 



2.42 
0.56 
26 



1 .94 
0.68 
26 



00 



High 

Total 

Reading 



Total 



Table 21- 

V 

Total Reading 



Picture 
Expository Narrative 



Written 

Dry Narrative 



X 

sd 
n= 



Low 
Total 
Reading n= 



X 

sd 



High 
Total 
Compre- 
hension 



X 

sd 
n= 



Low 
'total 

.Compre- n= 
f hension 

E^C .• 177 



16.27 
'3.62 
33 


17.10 
4.*99 
•30 


15.76 
5.16 - 
35 


14.97- 
3.08 
29 


15.10 
2.51 


12.35 ' 
2.36 r 
24 


I 14.08 A 
^ 5.03' \ 
20 


12.11 
3.73 
26 . 


t 

/ 

Table 23 

Comprehension ' 

Picture Written 
Expository' Narrative Exoositorv Narrative v 


-s 

15.80 
3.63 
28 


16.86 
5.08 
26 


16.25 
5.37 
29 


14.88 
3.56 
25 


15.89 
2.89 
24' 


13.23 
3.47 
28 


13.92 
4.69 
26 


12.61 
3.61 
27 • 



Si 



Table 22 

Inferential Comprehension 
Picture Written 



High 
Total 

Inferential n= 
Comprehension 

Low X 
Total sd, 
Inferential -n= 
comprehension 



Expository 


Narrative 


Expository 


Narrative 


16.06 


16.57 


15.89 


14.73 


3.60 


- 5.12 


5.14 ' 


3.14 


28 


30 


30 


33 

t . 


15.58 


13.02 


14.27 


11.92 


2.92 


3.12 


5.08 


3.84 


24 


24 


■2-5 ' 


19 



Table 24. 
Vocabulary 



Picture 



Written 



High X 
Total sd- 
Vocabulary n= 



Low 
Total 



X 

sd. 



16.68 


16.59 


16.98 


14.65 


• 3.69 


5.31 • , 


5.67 


2.97 


23 


27 


24 


26 


15.18* 


"13.39 • 


13.74 


12.75 


2.81 


3.33 


4.24 


4.04 


29 


27 


31 


26 • 


\ 

\ 

* * 

' 178 



performance conditioned by mode of discourse. Pictures improve 
poor readers 1 expository writing to the extent that they at least 
equal and sometimes outperform good readers in the' same treatment 
condition, exceed good readers in any other condition, and surpass 
other poor readers. The size of these effects ranges from around 
.1.5 s.d. to .5 s.d. 

4. Picture prompts facilitate good readers' performance on narrative 
writing tasks. ; x 

5. The effect of pictures is negative only for the subscale dealing 
with mechanics.e.g. , spelling, punctuation, etc. 

6. Narrative modes receive poorer scores in general, but are particu- 
larly hard for poor readers. 

7. Expository writing, stimulated by written prompts, provides an 
adequate opportunity for good readers to demonstrate their 
competence. 

Implications For These Findings " 

~~ ' i 

Of most interest, of course-, is the replication of these findings . 

under conditions which sample a range of narrative and expqsitory tasks in 

picture and written prompted situations. The collection of individual 

demographic data would also allow for finer grained interpretation. 

The analysis of why the narrative mode fares less well than exposi- • 

tory may be attributed to long term practice effects. Children may write 

mor^ expository than narrative prose, despite contrary claims of curriculum 

guiqes. It is surely the case that raters have more practice, and comfort, 

with expository rather than narrative writing., Thus', the practice effects 

■ 173 



of raters may be perpetually confounded* with those of students. 

What is so powerful about picture prompted expository writing for 
poorer readers? Pictures appear to provide the necessary content for 
students to write about. Perhaps poor writing performance results from 
the lack of a content repertoire to^ write about. Students may be induced 
to express ideas if they are presented with a content base. Similarly, , 
one might question why good readers seem to do well with narrative when 
stimulated by pictures. In our studies, the narratives involved "making 
up" a 'story and called for some generative behavior of both content and * 
form. At one level, this task is much more abstract than that of exposi- 
tory writing, since a narrative line needs movement, imagination, and 
specific content. It may depend more on the general "frame, 11 a knowledge 
of standard story structures, and perhaps the experience of hearing nar- 
ratives read nloud. Poor readers may not have the skill to "make up* a 
story" and a single picture may present an insufficient prompt for them. 
Its effect may actually Le distracting. 

It is educators 1 penchants to sound the alarm for individualized in- 
struction whenever disordinal interactions occur. But our task is assess- 
ment, rather than exclusively instruction. Instead of matching good or 
poor readers to or.e or another combination of prompt forms and modes of 
discourse, our responsibility may be to provide students with alternative 
opportunities to demonstrate their writing competency. Apart from specu- 
lation of the function of prompts, our data suggest that single, arbitrary 
writing prompts do great disservice, particularly to those students who may 
need our special attention. . 



15.0 



22 



Perhaps the greatest challenge is to find ways to help students to 
retrieve and use the content that they already have to write about but do 
not recognize or acknowledge. They may be able to write, if we give them 
something to say.. The instructional implications of this analysis would 
suggest we spend a great deal more, time f and care in "pre-writing" activities 
to assure students have something to communicate. 



V 



23 



Re ferences 



Anderson, R. C. How to construct achievement tests to assess comprehension. 
Review of Educational Research , 42(2}, 1972, p. 140-170. 

Baker,, E. L. Long range, plan, 1978-1982. Center for The Study of Evaluation, 
University of California, Los Angeles, 1977\ 

Baker, E. L. ^ The evaluation of the early childhood education program . Los 
Angeles, CA: Center for the Study of Evaluation, 1976. 

Baker, E. L. , & Herman, J. Early childhood education . 'Los Angeles, CA: « 
Center for the Study of Evaluation, 1977. 

Baker, E. L. , Herman, J., & Yen, J. Early childhood" education . Los Angeles, 
CA: Center for the Study of Evaluation, 1978. ] 

Baker, E. L., & Quellmalz, E. S. Results of pilot studies . Center for the 
Study of Evaluation, University of California, Los Angeles, 1980. 

Herman, J. The relationship of individualized instruction variables and 
second grade reading, mathematics and affective outcomes. Unpublished 
doctoral dissertation, University of California, Los Angeles, 1977. 

Hively, W. Introduction to domain-referenced testing. Educational Technology, 
1D74, 14, 5-10. 

Levin, J* R., & Lesqold, S. N. On pictures in prose. Educational Communi- 
cation and Technology , 1978, 26, 233-243. 

Minsky, Mi A framework for representing knowledge. In P. H. Winston (Ed.), 
The Psychology of Computer Vision . New York: McGraw-Hill, 1975. 

PUts, M The relationship of classroom instructional characteristics and 
writing in the descriptive/ narrative rpode . Report to the National In- 
' stitute of Education, Los Angeles: UCLA Center for the Study of Evaluation 
'1978. (Grant No. 0B-NIE-G-78-0213)* 

Quellmalz, *E. S., & Baker, E. L. Effects of alternative scoring options 
on the classification of entering freshmen writing competencies . Report 
• to the National Institute of Education, tos' Angeles: UCLA Center for the 
Study of Evaluation, 1981. (Grant No. OB-NIE-G-80-0112) 

* e 

Quelln/alz, E. S. Interim Report. Defining writing, domains: Effects of 
discourse and response mode. Center for the Study of Evaluation, Uni- 
versity of California, Los Angeles, 1379. 

Quellmalz, E: S., & Capell, F. Defining writing domains: Effects of 

discourse and response mode. Report to the National Institute of Education 
November, 1979 {Grant No. OB-NIE-G-78-0213 to the Center for the Study of 
Evaluation.) > 



182 



Spooner- Smith, L. I nvestigation of writing assessment strategies . Report 
to the National Institute of Education, November, 1978. (Grant No. 0B- 
NIE-G-78-0213 to the. Center for the Study of Evaluation.) 

Winters, I. The effects of differing response criteria on the assessment 
of writing competence . ^Report to the National Institute of ^Education, 
November, 1978. (Grant No. OB-NIE-G-78-0213 to the Center for the Study 
of Evaluation. ) / 

/ ' • 

• ' / 

/ 

/ 

/ 



183 



r 

Deliverable - November 1981 



CONSTRUCT VALIDITY IN WRITING ASSESSMENT 
PRACTICING WHAT WE PREACH 



Annual Report 
Edys Quellmalz, Project Director 



\ 

S 

Grant Number 
NIE-G-80-0112 
P-3 




CENTER FOR THE. STUDY OF EVALUATION 
Graduate School of Education 
University of California - Los Angeles 



184 

^ ^ 



Construct Validity' in Writing Assessment: / 

Practicing What We Preach 
(Effects of Time and Stragegy Use on Writing Performance) 

* 

Although writing is one of the three* basic skills, it has received 
much less attention in research, instruction, and assessment than have 
the other two subjects areas, reading and mathematics. Now, however, 
accountability and minimum competency testing mandates have begun to expose 
the lack of understanding and attention to writing. The urgent need among 
school practitioners for.a reliable, economical , system for assessing student 
writing ability hassled to the measurement of writing, through readily iden- 

r r 

tifiable, countable, text features/ Accordingly, testing research issues in 
writing have focused upon reliability: rating scales, rating procedures, and 
task parameters such as selection of topic or mode of discourse . This 
narrowed perspective on writing raises a second measurement issue, validity* 
To what extent can we feel confident that writing assessment procedures are „ 
ipeasiiring nontrivial writing skills? A growing number of researchers in 
'writing skills voice grave doubts about the* construct validity of prevalent 
rating scales and testing methods. For the most part, these people cite 
erronepus measurement assumptions that emphasize a static and decontextualized 
written product; others express concern for the apparent Tack of theoretical 
basis for many measurement decisions (ilmig & Parker, 1976; Gere, 1980; Hirsch, 
1977; Odell & -Cooper, 1980; Pdlin, 1980; Smith, 1979). 

In contrast to practitioners and test developers, most researchers have 
focused upon establishing and validating theory and theory-basetf models of 
writing. The most recent, successful, and widely endorsed efforts propose a. 
dynamic view of writing as a set of "recursive" processes. In particular, 
cognitive information-processing and problem-solving theories applied to 
writing have resulted in similar models of the writers'* active engagement of 



the writing task (Hayes & Flower, 1979; Nold, 1979). A growing amount or 
process-based research supports these cognitive models (see for example, \ 

<• ft 

Bracewell, Bereiter, & Scardamalia, 1980; Bracewell, Scafcdamalia, & Bereiter, 
1980; Matsuhashi & Cooper, 1978; Perl, 1979; Stallard, 1978). 

Briefly, the model of writing this study assunred can be characterized 
as a cognitive, information-processing model, comprised of two major inter- 
dependent and overlapping processes: composition knd transcription. Composition 
refers to the invention of the message context, to activities occurring before 
writing; transcription refers to the encoding of the message } the actual pro- _ 
duction and refinement of the message (Stallard, 1976). These two large pro-' 
cesses subsume many subtasks and skills.. 

During composing, the writer makes decisions about the audience, writing 

purpose and topic. These decisions act as focusing and refining criteria 

which influence the recurrent search and selection activities that shape the 

u 

message during writing. That is, before actually writing, the competent 
writers conceptualize their intentions. This framework, in turn, acts as a 
plan guiding writing. Such a plan, .e.g., the "intended meaning representation^ 
(Nold, 1979), may affect organization, amount and kind of detail and summary 
generalizations, tone and teve'n syntax in the written ^product. 

The transcription process also can be broken down into subtasjcs. These 
include "recursive" or recurrent planning and revising during writing, massed 
revision efforts, and editing, e.g., of mechanics diction*, spelling. These 
activities require reading, or rereadijna*_the text during and after writing, 
to*€ormulate a sense of what has been produced thus far. The writers then 
:6mRare this "text meaning representation" /(Nold', 1979) with their Original 
intentions. Any resulting dissonance suggests appropriate revision strategies* 
carried ot^ through deletion, substitution, addition, or rearrangement of the 

< 

186 



text. (Sqmmers, 197§). 

This model presents writing as a complex activity involvihg many^sub- t . 
skills and processes', each of which draws upon an individual writer's 1 imit^d 
resources (attention, effort) and capacities (memory), as'well as stores of 
information about the writing topic and the reading audience. Thus, in human- ' 

c * 

information-processing terms, writing can be viewed as both "a resource-limf ted 
and a data-limited task.*' The effect of resource demands from the myriad 
activities or subskills required for competent writing /performance has been 
termed "writer overload" (Nold, 1979). However, although writer resources y 
are limited, they may be stretched or augmented. For instance, the writer 

m^ become more adept at soijie of the subtasks*, *thus free to "pay les§ attention" 

j % * 
to them. This concurs with descriptions of "skilled writers" -at work (Hayes 

' ' ' 

& Flower. 1978; Matsuhashi & Cooper, 1978; Stallard, 1974). Or,' the writer may 
employ a "metaplan" describing strategies for efficient deployment of resource 
across the required subtasks (Flavell, 1976; Miller, Galanter & Pribram, 1968). 
This may describe the implicit goal of instruction irTpre-writing activities 
that are often explained in such terms (Odell , 1974; .Young, Becker, & Pike, 1970) 
Another means of stretching writ?* 1 ^sources is to introduce mGre information' 
into the task, cueing the writers and thereby reducing tht processing require- 
ments for attention. In effect, the writing task procedures may be manipulated 
to assume some of the burden* of the many processes required to compose and 
transcribe tFie written response, an essay. In such a case, the res*ource- 
limited model of writing suggests*writing performance ought to be facilitated, 
improved. , * 

This last method, i.e. manipulation of task components, describes the 
methodology this study efoployed to examine the writing pVocess construct and 
its implications* foe test cfesign in writing assessment. This approach has 



^. * • * 187 ' 



been termed "facilttative intervention" and used in construction and valid- 
ation of cognitive models -of behavior and, in particular, in studies of 
writing instruction (Bereiter, Scardafoalia, & Bracewell, 1979). This study 
did not, however, employ an instructional intervention and then test the 
sensitivity and fidelity of dependent nteasures. Instead, the study inter- 
vened in the assessment phase, breaking apart the usual assessment task, 



i 



h empirically. supported theories of 



writing an essay on a given topic, into subtasks identified with pre- 
dominant instruction of theory and wi 
the writing process construct- 
Clearly, a lot of mental activity occurs before, during, and after the 
writing of an essay. Cognitive theorists include these activities and their 
simultaneous, interdependent nature, in the description of writing skills. 
Given such a rich process domain, testing writing by scoring essay samples 
seems a questionable evaluation of writing competence or achievement. While 
essay writing surely calls upon the writer to perform all of these skills, 
it can adequately measure only the* extent to which the student writer is able 
to "put it all together 11 in producing an essay. That is, there is no rating 
given Jfer skill at correctly interpreting the tojnc, audience, or purpose of 
a giverMpsk^ There is no rating describing 'competence at planning and re-, 
vising skills!' Nevertheless, research and theory suggests these are the 
basics upon which the essay is built. Furthermore, these subskills or 
processes are accepted by teachers. The publications of the National Council 
of Teachers of English and the focus of instructional methods consortia (the 
Bay Area Writing Project and its spin-offs ill forty-one states) endorse and 
encourage process instruction in writing. 

< - 

To the extent that the cognitive process models of writing are viable, 
current testing is short-changing both the^student writers and their teachers 



188 



by testing and "scoring 11 only the criterion performance, integrative essay 
production on a given task. There may be enroute or prerequisite skills and 
writing processes at which students are, in facj), competent performers and t 
for which they have received effective instruction. Yet these students 1 
competencies and growth in writing may be lost because of the "overload" 
arising* under tests of writing requiring only an essay response to a given 
topic. . ,# 

z 

' Method 

Subjects and sampling 

■ ' ■ ^ 

TeHth grade students (N=320) from two Los Angeles area high schools 

participated in the study. The two, high schools and the study samplfe were 

racially mixed with respect to Asian, black, Hispanic, and white student 

groups. The schools, from different schoo.l districts, drew from middle to 

low-middle income neighborhoods .\ Students whose teachers ratdd them vulner- 4 

able to 'language interference problems from a non-English primary language 

were excluded from the sample (n=18). 

Procedures ; 

" — ' / 

Within each classroom- (n=l 3, classrooms), students were randomly assigned 

to one cif six treatment variations. The independent variable in this study 

T • •■ '■ 

was strategy assistance. Strategy assistance refers to the worksheet -activities 
given to students to assist them in carrying out processes that are hypothesized 



requisites for good writing. These processjes have been described above as' 
planning and revising. In thi % s study there were two dimensions of' planning , 
and revising. assistance: (a) broad level ^ Task Only, a^d (b) specific, level, 
Task-Response focused. Ta*tc Only worksheets as'ked studerits questions about 
the,, content, purpose, and audience of the given writing task. Task-Response 
worksheets asked students those questions interspersed with three additional * 

t 

189 



questions about the same qualities in their own essay response (either for 
planning or revising). Thus, the four treatment groups were: (a) Planners 
receiving Task Only worksheets, (b) Planners receiving Task-Response work- 
sheets. Two additional student groups wrote unassisted by worksheets: 
(e) Unassisted, Two-Day writers, (f) Unassisted, Ojie-Day writers. These 
latter groups allowed comparisons of the effects of strategy assistance and 
of simply ''extemded time" for writing. Directions to both Unassisted groups 
did suggest tney use their time wisely to plan, draft, and revise their essay. 
Students remained in their- intact classroom throughout. the four consecu- 

r 

tive days of the study. Individual study packets contained the distinctive 
task. instructions and materials. Study monitors presided over the classrooms 
at all times during the study. Regular classroom teachers remained in the 
room, but were asked to distance* themselves from the proceedings. For tbe 
ftost part, these teachers sat in the back of the room grading papers or plan- 
ning assignments. The students did not have any difficulty understanding the 
'nature of their daily tasks, the larger four-day context of those tasks, nor 
the fact thatvdifferent students were engaged in different tasks. Study 
monitors did not identify any significant disruptions or confusions. 

The study topk place over four consecutive days in Spring. The first 
day was used to collect baseline samples of expository writing from all 
students. The second and third day students wrote on one of two different 
treatment essay topics, randomly assigned within treatment groups. The essay 
task askedl for an expository essay written to. an audience of peers (for the 
school newspaper) about the value(s^) pf either summer jobs or elective 

courses. Students used a blue pen on the first treatment day and, a black pen 

/ 

on the second treatment day. This helped identify the focus of each day's 
efforts. Students in all groups were instructed to keep any drafts, notes 

' 190 



or outlines they generated. These continued to be available to them through 
their individual study packets. The dictionary and thesaurus could bp used 
if desired. On the fourth and last day, all students completed questionnaires 
on their perceptions of writing' under "usual 11 conditions, and on their -per-, 
ceptions of instruction received over the semester in their English composition 
class. , * • 1 . 

Dependent measures * • 

•The dependent variable in this study was writing performance. 4 Students 1 
treatment essays were scorqd using two measures, each based upon very different 

o * 

assumptions about the nature of writing. These two measures tyere: fa) CSE 

Expository Writing Scale IV, (b) primary and secondary trait rubrics. The 

first scale is an analytic and hgJistic rating scale developed at the Center 

for the Study of Evaluation, UCLA (Quellmalz, 1980)*. A six category rubric, ■ 

it rates students' essays as a whole (in categories of General Impression 

and General Competence} and by analyzing specific features (in categories of 

Essay Coherence, Paragraph Coherence/ Supporting Detail, and Mechanics). In 

each of the six categories there is a six point range for describing competence; 

scores three and below are considered below mastery or "criterion" for' 

ccwpa^ence. The major assumption of the CSE Expository Scale IV is the exis- 

j » 
tencA or s^eralizable features of good writing that can be identified with 

the dcmain of expository essay writing, regardless of topic, audience or' 

context of the given essay task. The CSE scale and its assumption about writing 

reflect predominant essay measures and beliefs about writing assessment found 

in school districts and state educational 1 agencies (cf, Spandel and Stiggins, 

1980).' 

The second essay rating measure, primary and secondary trait scoring is < 
best described as a method of constructing both tasks and task-specific scoring 



8 

» 

rubrics. That is, this assessment system assumes just the opposite of the ✓ 
CSE agd similar measures, \ .e., Wfitihg\performance is highly affected by 
context and content of the v task.v Unlike the CSE Expository Scale which can 
be used without modification for a variety of expository tasks, the primary 
trait scale is built ft?om a careful analysis of the features of each'unique 
writing assignment. To the extent that assignments differ, then too, the 
rubric requires. alterations. Clearly a more labor-intensive method, the 
primary and secondary trait scale is more popularly endorsed by theorists and 
f^esearchers than by practitioners. Nevertheless, its specificity has prompted 
use by the National Assessment of Educational Progress to track national ^ 
writing achievement over time, and by CEMREL Incorporated 1 r « eheir attempt to 
devllop a writing curriculum (Klaus', Lloyd-Jones, Brown, Littlefair,- Mullis, 
Miiler, & Verity, 1979; Mullis, .1976). For this study, essay topics were 
developed from a set of domain specifications, i.e., as a domain-referenced 
test might be. The scoring rubrics for primary and secondary, traitc were 
.virtually identical for the two topics as they were constructed from the same. ■ 
set of task criteria. Essentially the primary trait assessed was the use of 
related support^to link generalizations about "values" -to specific features of 
"summer jobs" or^'eTecrHve courses," The secondary trait assessed was the use * 
of peer-appropriate referents td -establish a specific audience, peers. The 

4' 

primary trait was rated on a zero to four point system; the secondary trait on* 

\ 

a one to four system. ^ 

..Baseline essays were obtained from all students on the first day of the 
study. Instructional history questionnaire data were also obtained and included 
in analyses. Teachers were asked to rate their student's* writing ability on a 
six point scale for which each point was carefully defined. The baseline essays 

and teacher ratings were available foi use.as covariates in final data analyses, 

-j 

192 



Data Analyses * 

Jhree sets of analyses use J essay scores derived from the two dependent' 
measures- Analysis of variance provided answers to the major research ques- 
tions about the effect on writing performance due to extended time for writing 
strategy assistance, and tiding and specificity of fchat assistance. Base- 
line ess^y scores and teacher ratings of student writing ability were used as 
covariates in analysis of cbtfariance investigations of entry skills upon^study 
treatments. Stepwise regression analyses used students' questionnaire data to 
determine the interaction of students 1 skills, instructional experiences, and 
perceptions with the study variables. 

Rater reliabilities for the two* measures were caluclated as generaliz- 
abilfty coefficients. For the CSE Expository Scale IV, rater agreements on 
students' essays were: General Impression = .77; General Competence = .76; 
Essay Coherence = .64; Paragraph Coherence = .63; Supporting Detail = .69; 
Mechanics = .75. For the primary and secondary trait scores,- reliabilities 
were .75 and .70, respectively. 

The philosophical differences between practitioner and research perspec- 
tives on writing,, presumed to underlie the two dependent measures, were borne 

out fn the^study data. Correlations between subscales of the CSE Expository 

« 

Scalfe IV and the primary and secondary traits can be considered moderate at 
best. 

Insert TabTe 1 about here 

The Primary Trait Rubric emphasized building a relationship between general- 
izations bout "values" (of summer jobs or non-basic skills classes) and the 
particular features (of working or a, given class) that facilitate or generate 
those values. This might be considered an emphasis upon support and coher- 
ency at the essay level. Not surprisingly, those were the CSE subscales with 

193 



\Table 1 

Correlations Among Dependent Measures 



CSE Expository Scale Vf 


Primary Trait Rubrics 
Primary Trait ' Secondary Trait 


General Impression 


.47 


'. -.10 ■ 
.♦ 


General Competence 


.42-. 


, T 


Essay Coherence 


t X .47 


-.12 


*> 

Paragraph Coherence 


.32 


-,06 > 


Support 


.49 . 


-.13 


Mechanics 


.28 


-.12" . 




/ 





Note* With the exception of Paragraph Coherence, correlation Coeffi- 
cients are based upon a sample of 230 student essays." Paragraph* 
Coherence figures reflect a sample of 139 student essays in which 
paragraphing was attempted. Essays without evidence of paragraphing 
by indentation or line skipping between blocks or prose, were assumed • 
to be one-paragraph essays. To score them in Essay Coherence and 

♦ 

Paragraph Coherence would mean to score the same skill twice. They 
were scored, as "missing data 1 / for Paragraph Coherence. This 
practice is reflected" in all analyses of this report. 



194 - ' 



. 1. 

which the P^*Hgy Trait' scores, were ;nost strongly* albeit moderately, cor- 



related (r= .49 and .47^ respectively). However, as might be expect^ from 

the underlying differences in perspectives on writing, Primary Trait and CSE 

Scale score Correlations were only moderate, though significant (ranging 

* from r = .32 to .49). The Mechanics subscale was a standout exception. 

* * 

Primary Trait is intended 'to "overlook" mechanics and syntax ^rors in favor 
of assessing students 1 fulfillment of the communicative intent of the task. 
Thus, the. lower correlation, r = .28, was not unexpected; The Secondary 
Trait rubric emphasized audience ^ensitivity in students 1 essays. Tiffs 
sensitivity was defined in terms* of tone, wording, and content referent 

markers throughout the text. As the CSE Expository Scale IV attended to 

• ♦ 

♦audience cpncerns only as part of its General Impression rating, the lack*of 
significant, correlation between .Secondary Trait and CSE subscale scores was 

. expected (coefficients grange from r - -106 to, -:13). It isxwortti noting that ' 
all correlations between Secondary Trait and CSE subscales were negative, though ♦ 
low. * ' < ' % 

CorVelations among the subscales of the CSE Expository Scale were quite 
high, a phenomenon observed in some previous studies using earlier versions , 
of the scale (Quellmalz & Cape! 1 , 1980). < 



Insert Table 2 about 'here 



EMC 



In particular, General Impression, General Competence, Essay Coherence and 
Supporting Detail- ranged from .87 to .96. Paragraph Coherence, although less 
strongly correlated with other subscaleS, was still quite high with values 
ranging, from .80 to .92; Mechanics was the most independent of the six 
categories, as might be expected. Nevertheless, moderate, significant co- 
efficients were obtained for correlations with other subscales (ranging 



195. 



Table 2 - • * . . 

.Correlations Among Subscales of the CSE, Expository Scale IV 

• • * , • r 

General Essay Paragraph 
L Comptence Coherence ' Coherence Support ' Mechanics 

General % ' ■ 

Impression .92 

General 

Competence — 

Essay 
Coherence 

Paragraph - 
Coherence 

Support 

Mechanics • 



.93 


.77 


. -.96 


.65 «• 

? 


.87 


.72 


• .89 


..63 




k 

.78 


j.90 . 


.58" 










» 




.80 


' .45 








.53 




• 







Note. With the exception of Paragraph Coherence, correlation coeffih 
cients are based upon a sample of 223 student essays. Paragraph % 
Coherence reflects a sample of 139 student- essays fn which paragraph- 
,ing was attempted and could therefore be'judgejT. 



196 



from r = .45 to .65). In'regression and some post Hoc analyses, described 

betlbw, the four, highly correlated subsc^les were collapsed (by straight 

* averaging) into a single CSE Expository Text Lev^l Score; Mechanics remained* 

r intact. * ; ' * ' 

The, student questionnaire data were selectively used to create seven * 
* * ** 
nstructional variables describing the instructional emphases ihd practice 

opportunities in students 1 English composition instruction during the semester. 

These variables were: practice with extended time available for writing; 

instruction apcT practtqe on planning; instruction $nd practice on revising; 

instruction on organization and support; instruction and practice on audience 

considerations in writing; instruction orv grammar and punctuation; and, 

practice on expository writing. i In addition to these instructional variables, 

„ regression analyses used students 1 self-reported use of revision and planning . 

i - 
strategies, -and interaction terms suggesteo\by the correlation matrices for 

these variables. 

: '■ Results * 

. The variables in«the study includedTtime available .'on writing (one 
versus two day time period); strategy assistance (worksheets or no assistance) \ 
timing of assistance (as planning or revising); specificity of assistance (three 
^questions abput the* assignment versus six questions about the assignment and 
the essay response). The study investigated the cognitive , information-pro- 
sussing- model of writing which describes the writing process construct in terms 
of numerous interdependent skill? and subtasks involved in generating a 
competent essay. Jheory postulates. that this complex qf processes may over- 
whelm writers and. inhibit their performance of individual sk'ills at which they 
mey be competent. The study sought to alleviate this "writer overload 11 .effect 
,by breaking apart the wHting task to allow students to focus their, efforts 



197 




upon planning and revising process requirements. This was expected to 
facilitate performance by allowing students* skills,* e.g., at planning ancF 
revising, to be fully realized. 

Extended Time for Writing * • ;- " 

- To determine the effects of strategy assistance on writing performance, 

it was necessary to be able to exclude the effect of simply having more time 

available ig which to writ^. For that reason, the two-day, Unassisted Group 

» 

was ^included. It was also valuable to knov^whether any increment in per- - 
formance in the two-day, Unassisted Group would arise; for this reason, the . 

stutfy also included the traditional-, writing assessment setting: one day for 

* • * 

one ^ssay; • * 1 

I 

The comparison ^between the two day and one day, Unassisted student writers 
indicated that there was no. statistical difference in their performance. This 
result held true for both dependent measures in both the analysis of variance 
and covariance procedures. 

The student questionnaire askerf students about the significance of time 

constraints in a variety of contexts. First, asked about preferences for 

writing assessment, 69.8% of ike students (n = 20$) i ndi cat ed^they "would rather 

write one essay over two days, than write two essays in the standard time frame 

of one class period for each (usually about fifty minutes, less* "settling down" 

time). Second,, time constraint on writing-performance was raised as a possible 

problem student writers faced * among ojmer possibilities such as worrying' 

about "what the teacher wants," about "good grammar, " what to say in the essay, 

and so forth. Table 3 presents data on this second quest i on v about time 
% 

restrictions. , * ' 4 



Insert T^ble 3 about here 



138 



Table 3 

Students 1 Setf-Reported Writing Problems 



Writing Problem s* 



Students Indicating Yes 
Number : % of Total* 



Coming up with ideas to use>in . 

the essay 145 

Organizing my ideas • 123 

Finishing befortf I run out of time 122 

Eiguring out what the teacher „ 113 
wants * 

petting down "on paper the , ideas 

I have*, in my head 108' 

Writing in "good" English 106 

Knowing what to do to' make the 
-essay better 95 

r 

Going back to check . over what 

I've written : ' 40 

Nothing, I don't have much 
'trouble writing essays. 33 



49.8 
42.3 
41.9 
38.8 

37.1 
36.4 

32.6 

13.7 

11.3 -4 



Note . Students checked multiple responses; column total exceeds. 100%. 
^he number of students "answering the question totalled 291. 



V 



19i) 



"Finishing fepfore I run out of time" ranked third among the list of 
nine potential concerns for student writers. Forty-one percent f of the students 
(n = 122) indicated time constraint pressure -^as a problem for them. The 

third context* in which students were queried about time was in terms of beH%- 

* * 

.iors and processes going on during actual writing. Table 4 'presents restilts 
for this question. . /I 

Insert Table 4 about here. 



Included along with "thinking 'about l\ow much time is left 1, were such choices* 
as rereading, editing,- plaoning, rethinking ideas. In this grouping, time 
concerns were much less salient than other, confcerns, and>anked fifth on a 
list of eight items. However, 41% (n = 122) again indicated watching the. ' 
clock was something they wert* conscious of doing- while writing. 

The student questionnaire also inquired about students 1 experiences with 
time limits on writing. Table 5 presents these data. - t * 



. Insert Table 5 about rhere 



Here a-maj(jrity of students, 58.9% (n = 175), Indicated that tliej^ had had four 
or more opportunities during the semester to write their eskays over a longer 
period of time than one c3ass sessio/f or one "overnight" period. Only 16.5% 
of the respondents reported never or only once writing under such extended 
time conditions. * ... 

In short, students report time "cpnstraints are a rpajor concern during 
writing, as a general writing problem, efnd in consideration of writing- 
assessment- preferences . It appears that a majority of students. have regular 



opportunities to writa essays without time constraint pressures. Neverthe^'' 
les^, despite thei expressed preference for and experience with extpndea time , 

• A . • 



200 



Tafcle 4 

Students 1 Self-Reported Processes^D^ring Writing 



Processes During Writing 



Students Indicating Yes 



W nm hoy 
l\ Uiil UcX 


9- r»f Tn+o* 
T> OX X Olax 


1 QQ 


£fl Aft- 


190 


65.3 


X /a 


fil Is 


. 154 


5.2.9 


122 


41.9 

i 


56 


19.2 


37 


12.7 

f 


28 


9.6 



Planning ahead for the .next thing 
to say 

Rereading what I ! ve written, even 1 
before I'm finished 

Changing my zaSind about ideas 

Fixing spelling, grammar and /or 
punctuation mistakes 

Thinking about how much time is 
left 

Trying to keep in mind who's 4 going 
to read the ess&y 

Starting over lots of times 

Unsure (or) I just keep writing e 
until I'm through 



Note , Students checked multiple responses; column totals exceed 100%. 
^he number of students answering thp question totalled 291. / ~ 



201 



I. 



18 



Table 5 

• . Students 1 Reports of Opportunity, to Practice 

•v ' 

~ '. Number of Occasions in the Semester 7 " 

• * Two to Four or 

Practice on Instruction None Once Thrtee More 

Spend more thail one class , 
period or one night writing 

an essay * . ; 8,4% 8.1% 24,6%* 58,9% 

Write an pssay as if some*, le 1 
besides the teacher were 7 

* going to read it 4 3/. 9 17.4 ' 17.4" 28.2 

Write essays irt other classes 

like history or science 46.1 1 i4.2 18.0 21,. 7 

• Turn an, essay back into the 

teacher, after you rewrote * t . • 

all or part of it ;28.5 17.1 2fc.l 32.2 

« * \ * - 

Turn in a rewritten paper and 
get it back, graded, a 

second .time 40.4 17,8 ' 18.2 23.6 

' , _J i U_ 

, Note' . Total number, of respondents is 298. Figures represent the 

* * 

percent of total respondents indicating a particular frequency for 
1 1 each item. ^ 



1 



ER?C . 202 , 



A 



for writing, simply providing students with'thSt exjra time did not improve 



1 writing 
lir wrilii 



their writing performance in comparison to 'student^ writing under the con- 
strained time condition. • * 
Strategy Assistance Effects * . ; * 

At 'its- broadest level, the study .investigated the impact of strategy 
assistance as opposed to none, i.e., the extra time only, described just 
above. This contrasted the Unassisted two-day writers against the assisted 
groups of planners and revisers. Table 6 presents the means and standard 
deviations for each group. 

Insert Table 6 about here 



The analysiVof variance investigations did not yield strong effects for 
assisted groups on^any of the CSE Expository subscales. However, when - 
ability covariates were used, allowing greater control over the within group 
variation, marginal effects emerged for the Essay Coherence subscale of the 
CSE Expository Scale IV (p .07). Means among strategy assisted and unassist 
ed groups was largfely due to context of that assistance. 

The 'spores for assisted and unassisted groups looked promising for the 
Secondary Trait (see Tabled). It seemed that simply providing students* with 
worksheets that focused some attention on the audience feature cf the writing 
assignment, resulted \n imporved scores on that trait. This difference did- 
not yield^stong* significant results in analysis of variance (p - .11). It 
did seem that results might be more favorable when the context variable govern 

ing assistance was. examined. . 
**■ « 
the student questionnaire also asked students whether they considered 
* * % 

themselves planners and revisers when thgy normally wrote essays/ The possi- 
bility that the treatment worksheets interfered oc interacted with these * 
a ... 

students 1 planning* and revising processes was considered in regression . * 

O* i 

203 



y Table 6 

Means, Standard Deviations £pr Strategy by Specificity 



Planners Revisers 

Subscale Unassisted 3 6 Q's 3 Q's 6 Q's 

CS? Expository Scale IV 



\ General 


3.24 


3.19 


3.22 


3.20 


3.17 


Impression 


(1.03) 


(1.14) 


(1.05) 


(1.26) 


(1.10) 


General 


3.05 


3.24 


3.31 


3.15 


3.19 


Competence 


(0.95) 


(0.99) 


(0.87) 


(1.08) 


(1.06) 


i Essay 


3.21 


3.58 


3.54 


3.38. 


-3.33 


* Coherence 


(1.00) 


(0.S9) 


(0.91) 


(1.12) 


(1,13) 


Paragraph 


3.30. 


3. 43 


3.43 


3.43 


3.38 


Coherence 


(0.99) 


(0.95) 


(0.81) 


(1.05) 


(0.94) 


Support 


3.22 


3.29 


3. 45 


3.38 


3.40 




(1.03) 


(1.04) 


(0.83) 


(1.21) 


(1.04) 


Mechanics 


3.55 


3.51 


3. 59 


3.29 


3.24 




(0.99) 


(0.99) 


(0.88) 


(0.91) 


(0 97) 


Primary Trait Scoring Rubrics 










Primary • 


1.14 


1.13 


1.05 


1.16 . 


1.28 


Trait 


(0.69) 


(0.63) 


(0.75) 


(0.80) 


(0.73) 


Secondary 


1.24 


ft 41 


1.39 


1.59 


1.52 


Trait 


(0.72) 


(0.66) 


(0.70) 


(0.87) 


(0;64) 



Number 43 ,55- 39 41 40 

aragraph Co erence subscaie is scored only -where paragraphing 
has been attempted. Non-paragraphed essays or single paragraph 
essays vfeze considered cases of missing data for the subscale. According- 

-ly, the group size differs for that subscale, as noted: Unassisted n=28; 

* • * 

3 Q's Planners n = 34; 6 Q's' Planners n-30; 3 Q's Revisers n=25; 
6 Q's Revisers n=24. 



21 

v 

analyses entering self-reported strategy as a variable predicting dependent 
measure scores on the CSE Scale composite text-level scores (described 

v. _ ' ' 

earlier), the Mechanics subscale score, the Primary and the Secondary Trait 
scores. Thefee results are displayed on Tables 7a through 10b, and are discussed 
in the next section on context and specificity of worksheet assistance, with 
which they did, in fact, interact. 



v Insert Tables 7a - 10b about here 

Context of Strategy Assistance ' 

The two values for the context of strategy assistance describe the timing 
of worksheet assistance in relation to the two-day writing process. Students 
who received worksheets at the beginning of the first day were encouraged and 
assumed to apply that assistance in planning their essay. Students who re- 
ceived' their worksheets at the beginning of the second day were assumed to 
apply'that assistance in revising their essay. Students were obliged to cdm- 
par6 their worksheets before moving on to any writing or rewriting activities. 
The colored worksheets yj^re easy to spot in the classroom and monitors for the 
study were able to ensure that students did complete their worksheets as 
scheduled/ Students marked down the starting and finishing times for^complet- 
ing the worksheets. The average time for the six question worksheets was , 
twenty-five minutes; fifteen for the shorter ^worksheets. There did not appear 
to be a difference in time between Planning and Revising contexts. 

. Analysis of variance did not reveal any strong differences between 
Planning and Revising groups" on the CSE Expository Scale. However, the CSE 
Mechanics subscale scores were significantly different for the two strategy 
context groups (p = .05). Means for the two groups suggested that, for the 

2QZ 



0 , 

Table 7a 

Mechanics Scores Regressed on Strategy Treatment, 
Instructional Practice and Normal Strategy, and Interactions 

— j ■ ■ 

Source b » Beta f ratio 

Strategy Treatment, Step 1 

Plan <s . -.08 -Al .22 

Revise -.26 -.13 2.26 

c 

Practice and Normal Strategy, Step 2 

• * 4 

Extended Time for Writing .17 .25 9.81* 

Practice with Audience, Tone -.19 * -.21* 6.75* 

Practice with Expository Mode' .13 .18 5.58** 

Interactions » Step 3 

Revise Treatment x Normally -.69 -.15 5.09** 

Revise Only 

*p_/.oi 
**p _ .os • 



J 



206 



4 ••. * 



■ Table 7b 

4 

Correlations far Variables in the Regression Equation: 
CSE Itechanics Subscale Scores 



23 **■ ' 



Variables 



1 ' 



1. Extended Time for writing 

2. Audience Practice 

3. Exposition Practice 



4. Usually Revise^ Qily x 
Revising Assistance Group 



.54 .45 .04 
.41 .18 

K 

.01 



Note. The variables in the table are instructional variables created 
from the questionnaire data. 



v./ 



\ 



• " $ 



* **» • - -vs.-- * , . 



207 



Table 8a 

Expository Scale Regressed on Strategy Treatment, 
Instructional Practice and Normal Strategy, and Interactions 



Source » 


b 


Beta 




Stratekv Treatment. Steu 1 








Plan 


1.47 


.07 


.83 ^ 


Revise 


-.0,4 


.00 


* 00 


Practice and Normal Str^egy, Step 2 






/ 

36.46* 


Extended Time for Writing 


3.35 


.47 


Practice on Revising Activities 


-2.78 


-.33 


, 13.20* . , 


Practice with Audience, Tone 


• -2.12 


-.23 


5.93** 


Practice with Expository Mode 


.88 


.11. 


2.60 

> 


Interactions, Step 3 


r 






Normally Plan and Revice x 

Practice on ^Organization and 
Support 


1.30 


.18 


5.40** c 


Normally Plan Only x 

Practice on Revising 
Activities ' ' 


2.20 


.12 


3.58 ; 



Note . Analytic Score represents average over CSE Expository , . 
Subscales: General Impression, General Competence, Essay Coherence, 
Supporting Detail. Refer to Results section for explanation of 
conversion. 



*p _ .01 
**p _ .05 



208 




Table 8b 



Correlations for Variables In' the Regression Equation: 



CSE Expository Scale Score 



a 



25 



Variables 



6 



1. Extended Time 

2. Revising Practice ■ 

3. A udiencV Practice 

4. Exposition practice 

5. Usually JPlan ff Revise x 
Organization 5' Support 
Practice ^ 

, 6. Usually Plan Only x 
, Revising Practice 



.52 



.54. 
.73 



.45 
.33 
,41 



.35 
.41 
.31 
.24 



.03 
.02 
" .10 
.11 

-.25 



Note . The variables in the table are instructional variables created 
from the questionnaire data. * 



^he CSE Expository Scale score is a composite score representing the 
-average over the subscales of General Impression, General Competence, 
Essay Coherence, and Support. This transformation is described ip the 
text of the Results section. 



209 



Table 9a 

Primary Trait Score Regressed on Strategy Treatment, 
% Instructional Practice and Normal Strategy, and Interactions 

p 



T ♦ 

Source 


b* 


Beta 


F * 


Strategy Treatment, Step 1 


-v. 






Plan 


.01 


.01 


.01 


Revise i 


• 08 


.06 


Aft " 

.42 


Practice and Normal Strategy, Step 2 








Normally Revise Only 


-.23 \ 


-.12 


-2.35 


formally Plan* and Revise 


.19 


.14 . 


2.96 


Extended Time" for Writing* " 


.08 


.16 


3.34 


Practice oif Revising Activities 


-.16 


-.26 


10.84* . . 


Praqtice with. Expository Mode 


.05 


.04 


1.66 

\>t 



*p _ .01 



.J 




* 210 



Table 9b 

Correlations for Variables in the Regression Equation: 
Primary Trait Rubric Score 



Variables 

• .5 

1. Usually Revise Only 
2 . Usually Plan 5 Revise 

3. Extended .Time for Writing 

4. Revising Practice . 
5/ Exposition Practice 



2 3 


4 


5 


- Vi? • 


.23 


.04 


-\ ' -37 


.21 


.19 


t t 


.52 


.45 






.33 



Note * Variables in the table are from the questionnaire responses. 



Table 10a 

■ ' . * > 

Secondary Trait Scores. Regressed on Strategy Treatment, • 
Instructional Practice and Nottnal Strategy , and Interactions 



Source - ] - , * - b Beta. F 

Strategy Treatment , Step - 1 

•* » 

Plan .21 • .15 .13 

Revise * < .32 .22 6.01** 

Practice and Normal Strategy," Step 2 j 

Practice on Planning Activities -.14 -.21 6.83* 

Practice on Revising Activities .23 .35 20.83* 

Interactions, Step 3 * ' 

Revise Treatment x 

Noraally Plan Only -.35 -.13 3.72 



Table 10"b 

Correlations for Variables in the Regression Equation: 



Secondary Trait Rubric Score 

* * 


Variables ^ . 


1 


«2 


3 


1. Planning Practice 




.53 


.12 


2. Revising Practice « - 


• 




. .45 


3. Usually Plan Only x 

. Revising Assistance Group 




1.* 





Note , Variable's in the table from the questionnaire responses^ 




6 



30 t 



Mechanics score, the Revising groups were degrading their scores, rather than 
that the Planning group students were somehow performing better on this dimen- 

slon. Note that the Mechanics mean for the Unassisted Group is 3.55, and for 

» * * * 

* 

the Planners, 3.54 (see Table 6 'for means), 'post hoc comparisons using, 
Sdheffe*s test for significance supported this hypotheses (p = .04). 

The two assistance groups also differed significantly in their Primary 
Trait scores (p * .05), when teacher ratings were entered a§ a covariate to 
account for additional within-groUp variation. Means for the Planners and 
Revisers revealed that the Revisers were outpeVforraing the Planners (see 
Table 6). It appeared that this difference was primarily the result of better 
scores for ^he Revising Qroup with the six-questfon worksheet assistance. That 
is, the effectiveness of assistance was tempered by the Interaction of con- 
text and specificity for that assistance. 

The Secondary Trait score differences for the Planning and Revising looked 
more promising than. the Primary Trait scores had. However, the comparison of 
assisted (Planning and Revising combined) and Unassisted groups had not turned 

up statistically significant differences, despite the apparent difference 1n 

* - ' it 

means. Differences between Planners ?nd Revisers on ttols score were also non- 
significant; (p = .13). * | 
Specificity of Strategy Assistance 

Specificity of assistance-describes the distinction between the long, 
six-question worksheet and the shorter, three-question worksheet used by 
Planners and Revisers. The short worksheet asked students to decode the essay 
assignment in terms of its audience, centent, and purpose. The longer versions 
also asked students to either develop a plan in response to those features, or 
to interpret their own essay draft in ligfit of these features of tfte assignment. 
Contrasting the three and six question writers (across the planning and^ 



214 



31 



revising context), analy§is of variance and covariance did not yield any 

differences for this main effect of specificity of assistance. However, 

* > * 

interactions of specificity ^nd context for strategy assistance did yield 
some interesting results. 

Interaction erf Strategy Context and Specificity of Assistance 
P Means for all four strategy context by specificity groups suggested a 
few comparisons might reveal interaction effects for these variables. First, 
piean scores on the CSE Support subscale appeared to differ for the two Planning 
groups, depending upon which- version of the worksheet student writers were ? 
exposed to. The^short version, Three-Question Planners, averaged 3.29, com- 
pared to th'e Unassisted Group which averaged 3.22. The Six-Question Planners, 
nowever, averaged 3.45 on the same subscale. This mean was the highest of all 

* * * 

group means. The Revising groups, both the Three- and Six-Question Revisers,' . 
did not appear to be that much different in their scores. Analysis* of variance 
did not turn up. any interaction effects for the CSE Expository Scales, includ- 
ing the Support subscale. However, when the covariates were used, the inter- 
action of specificity and context did attain some significance (p « .05). 
Post hoc comparisons using Scheffe revealed the expected' margina^ifference 
between the Three- and Six-Question Planning groups (p = .06). 

Although the Primary Trait means looked promising for the effects of Six- 
Question worksheets by Planning versus Revising 'context, this difference did 
hot test out at a significant level under Scheffe (p>« .16). 

When the four highly correlated subscales of the CSE Expository Scale 
were collapsed into a single "composite 1 ' Text-level score for exposiVon^ the 
context by specificity interaction effect was marginally significant few* the 
Planning groups 111 analysis of variance' (p * .06). The Six-Question Ppi^ners 
outscored the* Three-Question Planners. 



215 



32 



Regression Analyses with Instructional Variables 

For the composite CSE Expository T ext-level Scale score, instructional 
practice that allowed extended time, for writing a single composition yielded,, 
significantly higher scores (p = .01 ). Other instructional variables of 
significance bore a negative relationship to the Expository Scale Text-level 
score. Instruction and practice !in revision showed a strong, negative relation- 
ship with essay 1 scares (p = .0>)« Instruction and practice on audjence con- 
cerns in wrjting, also, though less strongly, demonstrated a detrimental, 
influence on the Expository Scale (p,=^:05). Students who reported- them- 
selves as ftoth "planners and revisers" and who reported greater instructional 
emphases^on organization and su^orting delail in writing, scpfed more hicjhly 
on their essays (p, f .05), 

On the Mechanics subscale, higher scores were found for students^report- 
ing greater practice in expository writing (p = *01), and essay writing that ' 
extended beyond one class period (p = .05). Interestingly, negative influences 
on scores were found for students reporting greater' instruction and practice - 
with audiences' besides the teacher/evaluator and audience considerations such 
as style and tone (p = .01). Students in the .Revising treatment who reported 
themselves to be "revisers only," had 'significantly lower Mechanics, scores 
(P - .05). 

For Primary Trait scores, theionly significant variable was the negative 
influence from instruction and practice on revision (p = .01). The Secondary 
Traft scores obtained the opposite result; higher scores for students Report- 
ing, greater instruction and practice on revision (p = .01). Revision treat- 
ment group membership also resulted in higher Secondary Trait scores (p - .05). 
However, lower scores resulted for students reporting more instruction and 



4 



216 



33 



practice emphasizing planning activities (p = ^01). r 

. • m • ' Discussion 

Summary of Results • 

This study was successful in its attempt to break apart the essay test 
task into meaningful siibtasks. The domain ojf writing skills, defrned to 
include a cognitive process construct, is indeed legitimate* to the extent that 
.the study was able to operational ize some of those processes for students. 

Interestingly, treatments interacted with the two philosophically dis- 
tinct measures.. Planning-assisjted' students were superior to other students 
on the analytic scale categories, except mechanics. Revising students degraded 
theirscores on the, mechanics scale. On* the other hand,. primary trait scores 
were higher for revising students; for the secondary trait, e$ay audience, 
atl assisted students outdid unassisted peers. - 

/ Regression analyses using questionnaire data confirmed the study premise. 
Students who reported themselves as "planners only" were immune to the negative 
effects (on the mechanics scale) in prompted revision groups; students who 
reported that they were "revisers only" had their revision problems exacerbated 
by encouragement to revise. Students reporting themselves a"5 both "planners 
and revisers" were more effective regardless of the treatment group in which 
they found themselves . 

Interpretation of Results 

Strategy assistance treatment made a difference in the subsequent writing 
* . * - 

performance of a' significant number of students if^both the primary/secondary 

trait scale and the CS£ analytic scale. In General Essay Coherence ratings, 
students who first completed planning sheets scored higher than their revising 
(and unstructured) peers. In,the Support scale, this effect of planning de- 
pended upon the level of specificity of our prompting* Students planning with 



21,7 




ERIC 



34 



the six-cfuestion worksheet scored well above other worksheet prompted groups. 
On the primary trait subscale, strategy was "again a potent effect. However, 
here the revising sheet students outscored the planners. Planning effects 
then,*seem more salient when assessment methods emphasize test qualities pre- 
sumed £o exyst in all writing, i.e., generalized writing skills. On the 
other hand, revising seems more effective in assessments that emphasize 

communicative intent, i.e., skill in addressing the purpose ^and audience of 

/ • • - 

the writing, although it is unclear why this effect would not have been 
mirrored by the General Impression score of the analytic scale. 

Planning worksheets had students decode the given task in terms of the 
content, the purpose and the given audience. It is, of course, true that the 
Impact of answering such items may be effective merely because it* slows students 
down to read the topic carefully. However, the differential effectiveness of 
the specificity of planning activities suggests that something more was going 
on, at le$st for the detailed planners. Further interpretation o* results 
should hote differences between the analytic and primary trait scal^e. Decision 
rules 'far classifying a paper "off topic 11 differed considerably. Frequently, 
otherwise "well written 11 (in a text sense) essays were included and rated 
highly. by CSE analytic scale raters, whereas the same essays v/ere judged "off 
topic" 1>y primary trait raters. Further, the analytic scale completely 
eliminated "off topic" essays, scoring them as cases' of "missing data," while 
the primary trait scale relegated off topic essays' to the lowest category of 
competence. / 

-^,-Alsb, the analytic scale assumed that its six categories measured separable 
text features, with the General Impression and General Competence scores function- 

7 

ing as global, composite judgments (see Results section). However, the correlation* 
between even the four presumably discrete text elements, Essay Coherence, Para- 



\ 



v 218 



. 35 

graph Coherence, Support and Mechanics were very high, with the exception of 

the Mechanics scale. Therefore, when examined across all treatment groups 

the subscale intercorrelations suggest that the CSE scale provided two 

distinct scores, 1 one at the text-level, one at the sentence level. In effect, 

t 

the CSE scale, with its five related scores, might have been able to account 
for more variation in raters, i.e., to be mojre sensitive to effects. * 

These ^'measurement factors aside, writing theory offers suggestions re- 
garding findings. The planning students might in fact have been led to for- 
mulate a representation of the task features before beginning to write. This 
sense .of task, theory proposes, guided writers in drafting their essay response. 
Thus we expected such guidance to be reflected in an "essay coherence" sub- , 
.stale. That is, planners had formulated a task context (parameters) within 
which to write. Further, the specific level planners who were prompted' to" 
plan their own essay, had rehearsed possible content before writing. Our 
model suggests that by relieving some of this "thinking while writing" l<fcd, 
we should have facilitated performance. For the specific planners, improved 
.writing performance was reflected, th^se students, 1 additional planning questions 
asked students to plan for main idea and supporting details, as well as 
audience factors to consider. It is gratifying to have^the effects of struc- • 
tured planning show up as significantly higher scores on the supporting detail 
subscale. a « 

Revising effects were confined to the primary trait scale, and the impact 
was uneffected by specificity of revising activities. Revising worksheets also 
asked students to decode the task (again, in terms of content, purpose, and 
audience). It is true that revising effects might simply have resulted from 
a "break" from writing and a fresh return to the task after completing 
revising worksheets. Hov/ever this "break" effect was also available ta 

21 Q 



36 

planners, most of whom began drafting their essay on tl^ first day, return- 
ing to finish it on the second day. That differences in the specificity of 
revision worksheets did not make a difference suggests that the general task 
decoding questions were sufficient. Simply the reconsideration of the given 
task appears to have provided revising students with some input to improve 
their essays. 

Why did revising affect only primary trait, and planning only the 
analytic scall? Two important assumptions in each of the measures may X " 
account for effects. # In primary trait scoring, raters were cautioned spe- 

r 

cificall# against letting students' grammar and mechanics interfere with 
judgments gf the primary trait. For both essay topics, the primary trait 
scale emphasized the writer's ability to build a relationship between spe- 
cific reasons and a general resolution. Secondly, this scale's highly task 
sensitive categories resulted in a very clear distinction between "off 
topic" and "eligible" essays. Under the C$E analytic scale rubric, few essays 
were deemed "off topic." 

Planning, however, did not include a "check" on the validity of the 
student planners 1 interpretation of the essay task. That is, planners might 
initially go "off topic" and without a critical reappraisal of match between 
task and essay response, never realign their essay. Unprompted students tend 
to constrain their revision efforts to word and sentence level modification. 
Planners, then, judged under a scale emphasizing task and essay, match, and 
de-emphasizing text features, might lose their advantage against a group of 
revising students and to larger, task-oriented reconsiderations. 

Revisers began writing without planning assistance. Obviously, "off 
topic" responses were also likely to be generated (perhaps even more likely 
so). However, revisers were halted somewhere between drafting and turning 

220 



37 

in the essay to be rated. In this writing hiatus, students were asked to 
go back to the original task or assignment and decode it in terms of its 
content, purpose and audience. After this "tf"* out" to reconsider the 
given task, revisers returned to their essay responses, the representation 
of the task components fresh in their mind. Further, these students had 
been cued to revise their essays in light of their worksheet responses. That 
is, they had been prompted to view revision in a broader context. Thus for 
revisers, text level features such as the use of transitions (valued in the 
essay coherence subscale) were less salient than the alignment of task and 
response. Accordingly, it is not surprising that primary trait scores were 
improved by revising activities. The absence of this effect from the analytic 
scale seems a bit difficult to explain, except in terms of the comparative 
strength of planning effects. It may be less that revising is ineffective 
for the analytic scales, than that the planning effect is simply greater. In 
fact, if we 1 ok at strategy group means, revisers do outperform the compar- 
ison group (two-day, unstructured) but nevertheless trait planning group effects 
In and of itself, students' usual writing strategies (self-reported) 
' affected scores on the primary trait scale when students were assisted by 
planning prompts, and particularly when this assistance was more detailed. 

■0 

Students who reported naturally employing planning strategies were more suc- 
cessful with their essays. The earlier reported effects were stonger. Addi- 
tionally, previously unaff^ted subscales of mechanics (CSE analytic scale) 
and audience on the secondary trait subscale reflected the impact of strategy 
treatments when ability (as usual strategy use) was entered into the model. 

Thus it appears that students may indeed be able to use strategies, 
planning at least, yet not be able to bring their strategy skills to bear up- 
on their essay. This is less .the case for revision; however, instructional 



221 



38 . 

history, as reported by students, suggests students receive little practice 
or feedback" on their revision efforts. While teachers correct and hand back 
papers with useful commentary, these remarks reflect upon the planning and 
writing more t|ian the revision. «They provide information about writing 
efforts so far, i.e., pre-revision, and supply information to guide next 
efforts on a new (next) assignment. 

Further, post hoc revision is much more difficult to prompt. Students 1 
planning efforts are always put into action,- even subconsciously, once the 
writing begins. However, simply encouraging revision, even having students 
reread and critique essay and task (as in our revision worksheets), does not 
ensure that students will use their revision information and ideas, nor know 
how to do so. In short, treatment in rev4sion was much less controlled by 
the structured context than by student cooperation and/effort. 
Implications for Test Design 

This study succeeded in helping students divide the essay task into sub- 
tasks they could handle. This expanded domain of writing skills included 
planning and revising processes and without supplying answers, as^.ed students 
to focus on the main features of the essay assignment* In particular, success 
was greatest for planning processes. Using a worksheet with items requiring a 
written response, students in planhing groups decoded the test task into fea- 
tures of audience (peers), content (topic 1 or topic 2). Presumably this 
supplied writers with a "representation" (albeit crude) of communicative- in- 
tent and task parameters. This alone made a difference in writing as the plan 
ning students drafted their essay responses. A planning subgroup was led 
further into planning processes by answering worksheet items about the actual 
main idea and support, and features' of tihe given peer audience that would for- 
mulate the essay response. In short, these planners made "plans to do"*and 



222 



39, 



and "plans to say." Carrying out these processes by direction and over an 
extended period of Cime (not at the expense of essay writing time), prompted 
writers seemed better able to cope with writing demVndT^nd prodqce better 
essays. 

Results from thisstudy bear upon test design in writing. It appears 
that there are, in fact, subskills involved in writing that affect the 
quality of writing whether that quality is defined in terms of text features 
or communicative function. It appears that these subskills or processes can* 
be broken out from the essay writing task.^ It also appears that many students 
who claim to perform these subskills are unable to do so effectively in their 
essay writing. This suggests that if we only measure writing in terms of the 
complex, integrative task of generating a complete essay we are missing student 
competence at lower levels of skill (developmental ly or hierarchically). 

On the other hand, if we expand assessment of writing to sample the full 
domain of writiha^s kills we may provide more instructionally useful and sen- 
sitive informationx^bout st.udent competence* Perhaps we should explore the 
pos~,Dle methodologies for assessingT^enroute" skills such as planning and* 
revising (beyond simple word and sentence level errors), e^en determining 
communicative purpose of writing. 

A second important implication of this study is the measurement focus 
issue. The differential emphases of the analytic and primary trait scale 
were reflected to some extent in the difference between strategy foci. Where 
planning led to greater cohesion and support, revising led tc greater success 
at fulfilling the task purpose and attending to audience considerations. 
Although it is unclear why the primary traits coordinate emphases on support- 
ing generalizations with specific detail dicL^ot correlate more strongly with 
the analytic scale ratings *of coherence and support. Further research efforts 



40 



might attempt to disentangle concerns for audience and task sensitivity 

by comparing separate ratings of these feature* with the composite primary 

f 

trait rating. - , 

In sum, yih believe this study has provided rationale, empirical support 
and some avenues to explore in the development of a broader, more valid 
'assessment approach to the writing skills area. 



V 



224 



ERIC 



✓ , Reference Notes 

1. Bracewell, R., Bereiter, C, & Scardamalia, M. How beginning writers 
succeed and fail in making written arguments more convincing. Paper 
presented at the annual meeting of the American Educational Research 
Association, Boston, 1980, 

2. Bracewell, R., Scardamalia, M., & Bereiter, C. An appl1ed>cogn1tive- 
developmental" approach to writing. Paper presented at the annual 
meeting of the American Educational Research Association, San Francisco, 
197SU 1 - 

3. Bracewell, R., Scardamalia, M., & Bereiter, C. A test of two myths about 

o 

revision. Paper presented at the annual meeting of , the American 
Educational Research Association, Boston, 1'980. 

4. Emig, J., & Parker, R. Responding to student writing: Building a theory 
of the evaluating process/ Rutgers University, 1977, 

5. Hayes, J. R. # & Flower, L. Protocol analysis of the writing process. 
Paper presented at the annual meeting of "the American Educational Research 
Association, Torn'oto, 1978, 

6. Hayes, J. R., & Flower, L. Writing as problem solving. Paper presented 

at the annual meeting of the American Educational Research Association, 

<r 

San Francisco, 1979, . * ■ * , 

4 

7. * Matsuhashi ,' A. , & Cooper, C. A video time-monitored observational study: 

the transcribing behavior and composing process of a "competent high school 
writer. State University of New York, Buffalo, 1978. 

V 

6U * Muljis, I. The primary trait system for scoring writing tasks. Paper . * ^ 

presented at the annual meeting of the American Educational Research 
* * 

Association, San Francisco./ 1979 (ERIC Document Reporducti on Service 
No. ED 124 942). 




k * 



225 



42. 



9. Hold, E. The Writing Process ♦ Unpublished manuscript, Stanford 
University, 1979. 

10. Polin, L. G. Alternative conceptions of the writing skills domain: 
Problems for the practitioners. Paper presented at the annual meeting 
<Jf the National Council on Measurement in Education, Boston, 1980. 

11. Smith, L. S. The Effects of DifferinglResponse Criteria on Assessment 
of Writing Competence / ' Unpublished doctoral dissertation, University 
of California, Los Angeles, 1979. 

12. Sommers, N. Revision strategies of student writers and experienced 

* > ** 

Writers. Paper presented at the annual meeting of the National Council 

of Teachers of English, San Francisco, 1978. "* 



ERIC 



226 



7 



'I 

* 

Reference List 

Flavell, 0. H. Metacognitive aspects of problem solving. In L. Resnick. 

(Ed.), The Nature of Intelligence . Hillsdale, New Jersey: Lawrence 

Earlbaum Associates, 1976/;. 
Gere, A. Written composition: Toward a theory of evaluation. College 

English , 1980, 42, 44-58'. 1 
Hirsch, Ed. D. The Philosophy of Composition . Chicago: University of Chicago 

Press, 1977. 

Klaus, C. H., Lloyd-Jones, R. , Brown, R. , Littlefair, W. , MulHs, I., 

Miller, D., & Verity, D. Composing Childhood Experience: An Approach to 
Writing and Learning in the Elementary Grades. St. Louis: CEMREL, 
Incorporated, 1979. 

Miller, G. A., Galanter, E., & Pribram, K. H. The formulation of plans. C 
In. P. C. Watson & P. -N. Johnson-Laird (Eds.), Thinking and Reasoning . 
Baltimore: Penguin Books, Incorporated, 1968. 

Odell, Li Measuring the effect of instruction in pre-writing. Research in 
the Teaching of Writing , 1974, 8, 220-240. 

.* • 
t 

0de1V, L. , & Cooper, C. Procedures for evaluating writing: Assumptions and 

needed research. " College English » 1980, 42, 35-44. 
,Perl, S. f The composing process of unskilled college writers. Research in 

■ • <> the Teaching of Writing , 1979, 13, 317-336. 
Quellmalz, E. Final Report: Controlling Rater Drift . Los Angeles: Center 

for the Study of Evaluation, UCLA', 1980. ^ 
Quellmalz, E., & Capell, F. Final Report: Defining Writing Domains: Effect s 
• of Discourse and Response Mode . Los Angeles: Center for the "study of 
Evaluation, .UCLA, 1979. 



Spandel, V., K Stiggens,- R. 4. Direct Measures of Writing Skill - Portland, 
Oregon: Northwest Regional Educational Laboratory, Clearinghouse for 
Applied Performance Testing, 1980. 

< 

Stall ard,'C. K. • An analysis of the writing behavior of good student writers. 

Research in the Teaching of English , 1974, 8, 20)5-218. 
Stallard, C* K. Composing: A Cognitive process theory of writing, College 

Composition and Communication , 1976, 27> 181-184. , 
Young. R. , Becker, A., & Pike, K. Rhetoric: Discovery and Change . New York: 

* * 

Harcourt, Brace & World, Incorporated, 1970. 



K 



228 1 



DESIGNING WRITING ASSESSMENTS: BALANCING 
. FAIRNESS, UTILITY, AND COST 



Edys S. Quell ma Iz 



: v 
i 

Center for the Study of Evaluation 
UCLA Graduate School of Education 
Los Angeles, CA 90024 



Grant No. QB-NIE-G-7.8-0213 



The research reported herein was supported in whole or 
in part by a grant to the Center for the Study of Eval- . 
uation from the National Institute of Education, U. S. 
Department of Education. However, the opinions and find- 
ings expressed here do not necessarily reflect the position 
or -policy of NIE and no official NIE endorsement should 
be Inferred. 



229 



. Designing Writing Assessments: Balancing 
^ Fairness, Utility, and Cost 

attain the fundamental goal of language competence, educators, 
students, and parents must have information describing the status and 
progress of language sJc^Hs development. Mounting concern for student 
achievement 1n writing, one of the principal arenas of language devel- 
opment, has refocused the attention ,of policy makers, evaluators, 1n- 
structors, and researchers on the features of writing assessments 
necessary to represent a studentls writing skill fairly, usefully, and 
economically. While the relationship between procedures employed to 
evaluate writing in large scale testing and those used in the classroom 
has historically been tenuous, the requirements of minimum competency 
testing programs have stimulated research oil methods to tighten the con^ 
nection. These competency testing programs require school systems to 
assess the status of b^sic skill achievement, and then either to certify 
tltat minimal competencies- Have be*n* attained or signal the need for re- 



mediation and provide repeated opportunities to pass comparable test forms. 
If these writing competency tests are to fulfil their intended function, 
then the writing assignments and evaluative criteria of large scale tests 
and classroom instruction must interrelate. 

At present, many large scale" writing tests bear little resemblance 
to students 1 classrocfwriting experiences. . Many states and districts 
rely on multiple choice tests that measure sentence- level editing skills 
or passage comprehension. When writing samples are collected, the structure 



230 



* and topic of the writing assignment' may call for information arfcl strategies 

; \ 

tha* vary considerably from students 1 experiences in and out of the class- 
rodm. Furthermore, writing samples are often scoicd rapidly and holistically 
, by raters trained to varying levels of precision and accuracy. Students 
receive a single score purportedly representing the level of their writing 
competence. * ' - \ 

. Reactions of practitioners and researchers to such current practices 
x * ; are increasingly critical. They find many faults' in current writing tests 
t' their logical and psychological relevance to realistic writing situations, 

their utility for- informing decisions about individual competence or pro- 
gram effectiveness, their fairness to students and instruction, their le- 

\ v gality for sanctioning exit requirements. This paper suggests that state 

\ . . * : 

\ and district writing assessments should re-evaluate their current methods 

for assessing student writing competence in light of these criticisms. An 
\ accumulating body of literature indites many of the methods assessments 

^ now use that have been derived from custom, folklore, and adaptations of 
norm-referenc^j testing methodology that are i napproprrate for the purposes 
i of competency assessment. 3y examining the criticisms leveled at writing 
tests and considering alternatives proposed by recent writing theory and 
research, we may find, solutions that will improve the fairness and utility 
of writing assessments, yet remain within reasonable economic bounds. 

Problem 1; Specifying Writing Goals 

Just what is "good" writing? For schools, a major conflict has been 
/ to distinguish between realistic characteristics^ minimum competence, 

reasonable high school writing exit' competence and the competence of prp- 



atfc . .231 



fesstonal writers and "experts. 11 A significant component in * hi s contro- 
versy over "standards" has been the function various types of writing 
can -and/or should have for* the student. Thus the discourse aim or writing 
purpose of transactional writing has been identified by many school systems 
as functionally most relevant to the majority of students. At the lower 
grades, expressive writing has been viewed by some as valuable in its 

own right and by othersas an educational vehicle for motivating writing 

* 

that will increase fluency and sentence-level competence. * 

* Clearly, the schools 1 definition of the target constrains the specific 
criteria that will provide logical and empirical evidence that the target 
has , been hit. Currently, goals may relate- to two competency levels, a 
minimum competency leve'f targeted.' by most state and district minimum com- 
petency testing programs and a reasonably desirable high school exit com- 
petency level implied in many systems' curricular goals.' Most competency 
programs emphasize transactional writing 1n the factual narrative, exposi- 
tory or persuasive modes. Minimum program goals are often that students 
write a clear, coherent paragraph that makes a point and that exhibits 
few or no mechanical, sentence-level errors/ For high school exit goals, 
English departments* set their sights at the multi -paragraph, essay level, 
seeking writing that has * theme or point, that is coherent between, as 
welj as within paragraphs, and that exhibits few sentence-level errors. 
While minimum goals generally specify functional writing, high school 
desirable exit goals may expand the types of writing aims or purposes' 
in which they would like students to be competent. By distinguishing 
between' minimum and desirable goals, school systems may be in a better 

to 

• 232 



position to defend the logic, utility and fajrness of focused test pro- 

4 

cedures. 

Problem 2: Designing Appropriate Writing Tasks 

Perhaps the most common controviersy in the design of writing tests 
swirls about the relative merits of direct and indirect tasks. Indirect, 
usually* multiple choice, measures have been defended by test publishers 

r 

because of their economy and high correlations with essay scores (Godshalk, 
Swineford & Coffman, 1966; Breland & Braucher, 1977). Critics of multiple 
choice tests reject them logically and psychologically. They argue that 
multiple choice tests present primarily editing tasks or comprehension.; 

tasks anjhthat they therefore do not tap the same kinds of mental processes 

i 

required) by production tasks (Bourne, 1966; Quellmalz, 1978; Cooper, 1979). 
Recent empirical studies of student's scores. on direct and indirect'mea- 
sures indicate considerably lower correlations between writing skill compon- 
ent scores derived from multiple choice and writing samples (Quellmalz & 
Capell, 1979; Quellmalz, Smith, Winters. & Baker, 1980; Moss, Cole & Kham- 
paliket, in press). Furthermore, Quellmalz and Capell (1979) /ound multiple 
choice test scores provided less distinctive information about underlying 
writing skill constructs or traits than did essay ratings (Quellmalz & 

i 

Capell, 1979). In combination, these studies support contentions that 
direct and indirect measures tap different psychological processes. These 
data would also, of course, suggest that, multiple choice test scores would 
"not serve as fair or useful proxies for actual writing skill. At best, 
multiple choice tests seem to over estimate skills (NAEP, 1981) since they 

' - ' 233 



measure skills presumably enroute to production skills (Skinner, 1957). 

In addition to the form of response required by writing tests, there 
is considerable disagreement about the appropriate structure of assignments 
used to prompt writing. Criticisms of writing tasks are that they do not 
present full rhetorical contexts that sufficiently inform students about 
the writing purpose, topic, audience, writers 1 role and intended criteria 
(Britton, 1978; Cazden, 1974; Scribrier & Cole, 1978; Florio, 1979). Re- 
search shows that writers perform differentially well when writing in dif- 4 
ferent discourse modes, e.g., exposition and narration (Veal & Tillman, 
1971; Crowhurst,1980; Quellmalz & Capell, 1979; Praeter & Padia, 1980; 
Baker & Quellmalz, 1980). Research also reveals that accessibility of in- 
formation about an assigned topic affects the quality of students 1 writing 
(Baker & Quellmalz, 1980). Pol in has found that when given extended time 
and cues about the rhetorical demands of the task during planning or re- 
vision, some writers improve some features of their work. In sum, studies 
of features of the writing task that influence students* writing perform- 
ance suggest that variations within features such as mode of discourse 
(writing aim) topic,* audience, time and structural cues do present dif- 
ferent psychological demands and therefore should be distinctly specified. 
To be clear and fair, the writing task should provide a full rhetorical 
context and time to engage in all parts of the writing process.' The cost 
of developing well formed writing prompts is not high, particularly in 
comparison to the cost of erroneous inferences about competence made from 
assessments of writing students generated in response to incomplete or 
ambiguous prompts. v 



234 



Problem 3: Specifying Scoring Criteria and Type of Rating Scale 

Criteria employed for evaluating student writing vary along a number 
of dimensions: from qualitative to quantitative; from general to specific; 
from comprehensive, full discourse features to isolated features; from 
vague guidelines to replicable, objective guidelines. 

At the*most qualitative, vague end of the continua are general im- 
pression scoring schemes where readers apply their own criteria to give 
the writing a single global score. Follman and Anderson's "Everyman 11 
procedures (1$67) and teachers 1 A~F general schemes fall in this category* 
Still providing a single score or quality rating, but guided by slightly 
more descriptive and acknowledged criteria are holistic rating schemes 
such as the ETS four or six-point scales ranking papers within a set. 
Teachers 1 use of a letter grade with some supporting comments might relate . 
to this evaluation scheme. Some rating schemes are specific to discourse 
mode, others, like the primary trait mating method, are specific to dis- 
course mode, and the particular topic (Lloyd-Jones, 1977)* The most de- 
tailed scales are analytic rating schemes referencing component features 
of the written product. 

Where do these criteria come frtim? Criteria for these scales may be 
inferred from^ features commonly referenced by knowledgeable readers, may 
'be arbitrary, or may be theoretically- or , empirically-based dimensions 
deemed important by the group designing the scheme. Analytic scales vary 
in the degree to which they comprehensively reference rhetorical, structural 
syntactic features, as well as the degree to which criteria for features 
are qualitative, more objective, or, >even, quantitative. In m 2 attempt 

& 



235 



to be comprehensive, the subscales of the Diederich Expository Scale range 
from "ideas" to spelling (Diederich, 1974). In contrast, analytic text 
analysis schemes such as T-unit analyses or Halliday and Hasan's measures 
of cohesion focus on isolated components of the written piece 
(Halliday & Hasan, 1976). Diederich's "flavor" subscale is far more quali- 
tative and judgmental than counts of numbers and types of cohesive ties. 
In classroom evaluations of student writirfg, grades and teachers 1 consents, 
too, may reference a range of essay features such as content, organization, 
and mechanics (Freedman, 1979); or their comments may only relate to sen- 
tence-level problems. 

One issue in developing or using a rating scheme is the meaning of 
writing score(s). From* a psychological perspective, does being a "2" 
vs. "4" .discriminate between levels of a student's writing competence? 
At present, there is little research evidence that any sets of criteria 
used in actual practice are more valid than others for discriminating 
between levels of expertise. From a logical perspective, how specific, 
replicable, and informative are rating criteria? Pedagogically, what 
implications do the scores have for diagnosing strengths and weaknesses? 
The bases of the score, the criteria, should serve as feedback to teachers, 
students P and parents. To be fair- criteria employed in minimum competency 
tests should specify writing elements thatSsgre J^esic writing skills, e.g., 




able tc instructional intervention. The more judgmental, qualitative, 
sophisticatedand less teachable writing elements such as flavor, style, or 
voice would seem less fair 3 and useful, and would, therefore be inappro- 




organization, support, mechanics. The criteria should also b'e those amen- 



23R 



priate as rating criteria for judging basic writing competence. Sgeci- 
fi cation of criteria may be the most important decision affecting the 
utility of information provided by assessment, both large scale and class- 
room level. Certainly, consensual decisions on these criteria should 
involve instructional and evaluation personnel. 

It seems logical that criteria used in large scale writing competency 
assessment should reflect, if not derive .from, criteria used to evaluate 

student classroom writing* An ideally integrated instructional system 

/ » 

that targets particular writing elements as important basic competencies 
would involve teachers and evaluators in specification of rating criteria 
and encourage focused classroom guidance, feedback, and evaluation on 
these .elements. Instructional ly, specification of valued basic criteria 
could provide a more comprehensive framework for teachers to focus instruc- 
tion and communicate feedback to students about their writing. The scanty 
Research on classroom evaluation methods suggests that teacher comments 
more often, cite easily identified sentence-level mechanical errors than 
text level feedback such as organization and support (Pitts, 1978; Quellmal 
Baker, & Enright, 1980). As Coffman pointed out, vtfrile few would recom- 
mend cdmplete restriction and regulation of the criteria teachers use in 
classrooVx writing assessment, neither would they condone subjecting stu- 
dents and the instructional program to wildly fluctuating, idiosyncratic 
standards of individual teachers (Coffman, 1971). Some standardization 
of writing criteria seems particularly critical for minimum -competency 
goals. And, of course, economically schools using criteria for system-, 
wide assessment that are also used in classrooms would eventually, con- 
siderably reduce the cost of training raters. 



/ 



237 



Assuming that criteria have been specified that arejogical , .fair, 
and useful, the format for recording scores remains a problem. Many Jftrge 
scale assessments report a single, holistic score. A lqjpcal questiorf* 
is whether it makes sense to comment en component features of a students 
writing instead of, or in addition to, its overall quality. A likely 
question to be raised about a single global, score by a teacher, student, 
parent (or lawyer) is "Why?" followed by "Show me." While writing theory 
may suggest that the "whole" is greater than the v sum"of its parts, research 
in psychology -and pedagogy suggests that learners advance when, taught how 
to use components and combine them into competent performance (e.g., 
Skinner, l'957; Resnick* 1980). Another logical question is whether students 
are differently classified as masters and non-masters and/or if analytic 
schemes yield a differential score profiler Winters (1978) found that 
vanous scoring rubrics including a general impression scale, two^ analytic 

M|> ' *- ' 

scales and a T-unit analysis, did classify- students differently. Quellmalz, 
Smith, Winters & Baker (1980) found that three separate holistic rubrics 
and an analytic rubric classified eriten^ freshman differently. Similarly, 
Polin (1980) found very low correlations between primary trait and analytic 
ratings of the same essays. Each of these studies compared' scoring rubrics 
which referenced some similar criteria but which, in application, produced 
Variable characterizations of the same* essays', still unexamined are the 
cost benefits of the scales usWng exact same criteria, but-recording a 
single, holistic judgment or several separate analytic scores. Such a 
study is currently in progress (Quel lmal,z, _ 1981). 

A major problem for large scale writing assessments, to be sure,- is 

t 



the cost of providing more detailed ratings. * In the narrowest sense, cost 
is measured in terms of time required to train raters and time required 
to rate papers. Generally, training on more criteria that are more ex- 
pi icit requires more time than training on fewer or less explicit criteria. 

Currently available data on scoring costs indicate that training time 
,for holistic and primary trait scoring averages two to four hours (Powliss, 
Bowers, & Conlan, 1979; Mull is, 1980) and for analytic scoring averages 
six to eight hours (Smith, 1978; Quellmalz & Capell, 197S). Trained raters 
can assign a holistic or primary trait score to, a student's paper reliably 
in 30 seconds to^l^ minutes (Powliss et al. f 1979; Mull is, 1980).. Rating ' 
time for providing five to eight separate analytic scores range from four * 
to-five minutes for multi-paragraph essays and from two to four minutes 
for paragraphs (Smith, 1978; Quellmalz & Capell,, 1979). 

In a recent stydy comparing two score formats, an analytic > scheme or 
/a holistic scheme modified to provide diagnostic checks for students;rated 
below mastery, Quellmalz found that average rating times per paper differed 
by approximately one minute (Quellmalz, 1981). Is the additional training 
and rating time "worth it?" School systems weighing this question might . 
consider broader definitions and implications of cost. First, the cost 
of either analytic or holistic training could be jointly shared as an in- 
service activity, by curriculum budgets. These training costs, woijld also 
then decrease to review, time wnen all teachers in a system were trained. 
A -second potential cost sharing strategy is to view es'say ratings as diag- 
nostic components of the instructional system to both focus and monitor 

* 

, 233 ' r 



program improvement. A third cost concern is an ethical one. Students 
have spent- considerable time producing writing samples and the psychological 

and opportunity costs to them of uninformative or erroneous classification 

j 

as failures can be profound. Finally, a system might consider the .degree 
of specific support useful for defending mastery/non-mastery classifica- 
tions; the costs of remediation and lawsuits because of misclassifications 
can be high. 

Problem 4: Technical Quality of Rating Criteria ~ 

A fundamental responsibility of an assessment program is the documentation 

o1 its technical quality. For writing assessments this becomes a problem 

of scale stability and validity, i.e., demonstrating that score criteria 

are applied uniformly within and between rating occasions and that other 

measures of student writing competence corroborate the test ratings (Quell- 

malz, 1980). . , 

\ I 

When carefully structured sjale training sessions precede actual rating, 
most holistic and analytic rating 'scales can demonstrate high* interrater 
reliability (Powliss et al., 1979; Mullis, 1980; Quellmalz, 1980; Steele, 
.1979; Van Nostrand, 1980). But interrater agreement within a rating session 
is not sufficient for demonstrating scale reliability. Analogous. to the 
problem of test-retest reliability, a reliable scale must be stable , i.e./ 
demonstrate that its criteria would be applied consistently by new sets 4 
of raters to both a new set of papers and to the set of papers scored* by 
the first raters. To the extent that criteria are differently applied, the 
scale is not stable*" and reliable (Quellmalz, 1980). 

'240 ■ .. • 



Few scales currently used in writincj assessment report data about 
their stability across sets of raters and rafting occasions. It seems 
that scales with more explicit and operational criteria are less sus- 
ceptible to fluctuating qualitative judgments and are more likely to be 
stable across paper sets and raters. Holistic scales such as the ETS 
method which awards scores according to a paper's ranking within a unique 
set of papers result in a sliding scale (Conlan, 1579). A "2" paper in 
one paper set may well hava characteristics quite different from .a ?\2" t 
paper in a set of papers with a broader or narrower quality range. While 
some attempt is made to stabilize judgments across sets of raters by in- 
serting anchor papers durjng trairting, anchor papers are less frequently 
interspersed in actual rating sequences. Statistical evidence of the 
comparability of scores giveh on any such anchor papers^jt, different groups 
of raters is noticeably, and seriously, absent. JPmis, holistic s.cales 
using ranking procedures within sets and unexplicated criteria are Citable 
for norm-referenc6d selection decisions, but can not m§et competency test 
requirements for stable, uniform application of criteria/ On the other 
hand, holistic scales based on more descriptive criteria such as the pri- 
mary trait method (Lloyd-Jones, 1977) foay be more likely to permit stable 
application acrpss paper and rater sets. Reports for most analytic scales , 
also document interrater reliability within rating occasion but do not 
track stability across occasions. .For analytic as well as holistic scales, 
precision of criteria is a critical factor in achieving scale stability. 
School systems designing waiting assessments should routinely report inter- 
rater reliability and check scale stability on common paper sets scored 



ERIC 



241 



at different rating sessions. These measures will reassure stakeholders 
that assessments are uniform and fair. 

The task Of documenting the validity of writing assessment rating ^ 
scales can take" several forms. Most competency-based writing assessments 
attempt to establish content validity through expert judgments about the 
skills assessed (Breland & Ragosa, 1976). Few writing assessment programs 
subject the rating scales used to evaluate those skills to content v?^*d- 
ity scrutiny as well. Since, for written production, the scale defines 
what acceptable writing is, the content validity of scajes should be judged 
by the same procedures as test items or specifications. It may be that 

some scales with' viftft^criteria or criteria heavily weighted toward sentence 

** • * 

level mechanics would not get the stamp of approval from a bf'oad range of 
experts. It should be noted that holistic scales with no explicit criteria 
are "content" 'free and assignment specific. These scales are not suitable 
for competency assessments. ' ' 

« 

Of course, content validity is only one index of validity (Cronbach, 

1971; Messick> i975). Both concurrent or predictive and construct validity 

\ ' ' ' - 

should be examined. The most common method for validating large scale 

rating schemes has been to report their correlations with other writing- 
related measures including other English grades, reading test scores and 
multiple -choice writing test scores. Many of these "criterion" variables, 
however,' are even more questionable indicators of writing ability than the 
rating scale being validated. A major problem, in validating- rating scales 

€ 

is identifying appropriate criterion groups and test scores (Winters, 1978; 
Quel^lm^lz, Spooner-Smjth, Winters & Baker, 1980). A directly r££ated 



?42 



criterion would be relationships of immediately preceding and subsequent 
writin^assignment scores. Unfortunately, as different'criteria are often 
employed in other rating scales and/or in teachers* grading of assignments, 
few appropriate direct comparisons are possitfte.* 

From the student's viewpoint, this problem raises concerns for fair- 
ness and instructional validity* How closely do thetiflteria used in the 
assessment match. those used in the classroom, and how closely do they 

represent writing skills for which the student na? received instruction? 

• <f 

Fundamental precepts of fairness require that if, a system hasn't explicitly 
taught .the skills, \t shouldn't hold the student accountable for being 
competent in tftese skills. For example, originality, humor, and flavor ' 
are desirable features of writing; they are not oftep taught directly* If 
we have no information on the criteria used in holistic scoring, that method 
isn l t fair; weHTave no way to determine if what wasr tested -was what was 
taught. The legal implications of „this dilemma" are obvious. 

» • > 

SUtomary • 

* 

Balancing) ideally detailed analyses of students' writing with the 
of those analyses is no easy task." School 'systems and teachers acrosw^ 




the country" are wrestling with the problem and arriving a* varying solutions. 



Some systems , don't even try to initiate large scajle rating of writing samples. 

Some teachers assign little writing and provide cursory or global feedback. 

$ 

Other systems are-willing to pay- the price and moupt articulated writing 
assessment and I instructional systems (e.<j., Detroit, Los Angeles, Pittsburgh) 



2*3 



A Some rating scherfts apply explicit, replicable, reasonable criteria; 
some scales are silly,. some are misapplied, some are downright harmful. 

Large scale assessments can devise ways to reduce the costs of train- 
ing raters to score large numbers of essays. In an ideally integrated 
assessment system, tasks and criteria for the large scale assessment would 
be the same as those used in the classroom* A district or state might 
construct a scale that referenced basic text components used by class- 
room teachers, e.g., main idea, coherence, support, mechanics, and devise 
a scoring system where -.papers were checked of£ as competent on each skill 
and also check off in more detail the components falling'below mastery. 
For example, one text might have, competent support and receive a mastery 
check; another essay might not and get a check for "details are not re- 
lated to the main point, 11 or. "details are not concrete. 11 ^ 

Systems might allocate the cost of training raters to staff develop- 
ment. All system teachers could be trained in applying the rating cri- 
teria which should promote greater articulation of the formal assessment 
with classroom criteria. Districts such as Detroit find it cost effective 
to pay lay personnel to rate writing samples. Alternately, the system* 
might ask teachers to swap papers. Teachers could use the rating. scale 
to score writing of other students in the district in return for having 
their students 1 writing scored by other teachers trained as raters. This 
would reduce training costs for district scoring. Many alternative logistics 
could be engineered to spread the time and energy costs efficiently with- 
in existing system resources. 



244. 



References 



Baker, E. L.> & Quellmali, E. S. Issues 1n eliciting writing performance: 
Problems In alternative prompting strategies . Paper presented at the 
annual meeting of the National Council on Measurement in Education* 
Boston. MA, April 1980. < 

Bourne, L. J. Human conceptual behavior . Boston: Allyn & Bacon, 1966. 

Breland, H. M., & Braucher, J. Ll Measuring writing ability. Paper pre- 
sented at the annual meeting of the American Educational Research 
Association, New' York, 1977. 

Breland, H., & Ragosa, D. Validating placement tests. Paper presented 
at the annual meeting of the American Educational Research Association 
San Francisco, 1976. * \ . 

Brltton, J. Jhe composing process and the functions of writing. (Chapter 2) 
In Cooper and Odell (Eds.), Research on Composing: ; r . Points of departure . 
Urbana, IL: National Council of Teachers of English, 1978. 

Cazden, C. B. Two paradoxes in the acquisition of language structure and 
i functions. In K. Connolly & J. S. Bruner^(Eds. )', The Growth >of Com- 
* petence . New York:/ Academic Press/ 1974'. 

\ . . ■ 

Coffman, W. E. Essay exams. In R. k. Thorndike (Ed.), Educational Measure- £ 
ment (2nd'ed.). Washington, D. cA American Council on Education, 1571. 
" *V > 

Gonlan, G. Compa'rison of analytic and* holistic scoring techniques . Princeton 
NJ: Educational Testing Service, 1975- , * : . 

! 

Cpoper* C. R. Current studies of writing achievemervt and writing competence. 
Paper presented at the annual meeting of the American Educational Research 
Association, San Francisco, 1979. ^ \ 

C^onbach, L. J. Test validation. In R^ L. Thorndike (Ed.), Educational 
Measurement (2nd ed.), Washington, D. C: American Council on Education, 

197T: . m j 

Crjowhurst, M. Syntactic complexity in narration and argument at three 
grade levels. Canadian Journal t)f Education , 1980. 

Diederich, P. B. Measuring growth in English . Urbana, IL: National 
Council of Teachers of English, 1974. 



245 



Florio, S. Learning to write in the classroom community: A case study. 
Paper presented at the annual meeting ofHhe American Educational 
Research Association, San Francisco, 1979: 

Folltoan, J. C. , & Anderson, J. A. An investigation of the reliability of 
five procedures for grading English themes. Research in Teaching of 
English; 1967, 190r20Q. 

Freedman, S. How characteristics of student essays influence teachers 1 
evaluation. Journal of Educational Psychology , 1979. 

Godshalk, F7 I.', Swineford, F., & Coffman, W. E. The measurement of writ- 
ing ability ^ Jtew York: College Entrance Examination Board, 1966. 

Halladay, M. A., & Hassan, R. Chohesidn in English , London: Longman, 
1976. 

Lloyd- Jones, R. Primary trait scoring. In C. R. Cooper & L. OdeTl (Eds.), 
Evaluating writings Describing, measuring, judging . Urbana, IL: 
National Council of Teachers of English, 1977. 

Messick, S. A. The standard problem: Meanings and values in measurement 
and evaluation. American Psychologist , 1975, pp. 955-966. 

Moss, P., Cole, N. , & Khampalikit, C. ,A comparison of direct and indirect 
writing assessment methods. Jo.urnal bf Educational Measurement , in press 

Mullis, J. A. Using the primary trait system for evaluating writing. 
National Assessment of Education Progress/ 1979. 

National Assessment of Educational Progress. Reading, thinking and writ- 
ing: .Results from the 1979-BO National Assessment of Reading and 
Literature. Denver, Colorado, 1981. 

** <- 

Pitts, M. P^latipnship of classroom instructional characteristics and 
writijr, in the descriptive/narrative mode. Center .for the Study of 
. Eva} nation, University of California, Los Angeles, CA, November, 1978. 

PoVin, L. Alternative conceptions of the writing skill domain: Problems * 
' for the practitioners. Paper presented at the National Council, on 
Measurement in Education, Boston, 1980. 

fowills, J. A., & Bowers, R., & Conlan, G. Holistic essay scoring: An 
application of the model for the evaluation of writing ability and the 
meaurement of growth in writing ability over time. Paper presented at 
the annual meeting of the American Educational Research Association, 
San Francisco, 1979. 



f Praeter, D., & Padia, W. Effects of modes of discourse in writing per- 
' formance in grades four and six. Paper presented at the annual meet- 
ing of the American Educational Research Association, Boston, 1980. 

Quellmalz^ E. S\ Domain-referenced specifications for writing proficiency. 

Paper presented at the annual meeting of the American Educational 
, Research Association, San Francisco, 1978. 

Quellmalz. E. S. Assessing writing proficiency: .Designing integrated 
multi-level information systems.- Paper presented at the annual meet- 
ing of the National Reading Conference, San Diego, CA, 1980. 

Quellmalz, E. S.' Report on Conejo Valley's Fourth-Grade Writing 
Assessment: Fall, 1981. 

Quellmalz, E. S., & Capell, F. Defining writing domains: Effects of 
discourse and response *nod§. Report to the National Institute of Edu- 
cation, November, 1979. (Grant No. 0B-NIE-G-78-0213 to the Center for 
.the Study of Evaluation.) 

Quellmalz, E., Spooner-Smith, L. S., Winter's*' L. , & Baker. Characteri- 
zations of student writing competence: An investigation of alternative 
scoring s ystems . Paper presented at NCME> £pril 1980. (Grant No. 0B- 
NIE-G-7S-0213 to the UCLA Center for the Study of Evaluation, 1980. ) 

Quellmalz, E., Baker, E., & Enright, G. Studies in Test Design: A Corn- 
prison of Modalities of Writing Prompts. Center for the Study of 
. Evaluation, University of Califprnia, Los Angeles, CA, November, 1980. 

Resnick, L. What do we mean by meaningful learning? Invited address at 
the annual meeting of the American Educational Research Association, 
Boston, 1980. 

Skinner, B. F> Verbal Behavior . New York: Appleton, 1957. 

Scribner, S., & Cole, M. Unpackaging literacy. Social Science Information , 
.1978, 17, 19-40. 

Smith, L. S. Investigation of writing assessment strategies. Report to 
the 'National Institute of Education. Center for the Study of Evaluation 
University of California, Los Angeles, November 1978. 

Steele, J. M. The assessment of writing proficiency via qualitative 
ratings of writing samples. Paper presented at the annual meeting of 
the American Educational Research Association, San Francisco, 1979. • 

Van Nostrand, A. D. Writing Instruction in the Elementary Grades: Deriving 
a model by collaborative research. Providence, RI: Center for the 
Research in Writing, 1980. 



Veal, L. R., & Tillman, M. Mode of discourse variation in the evaluation 
of children's writing. Research in the Teaching of English. 1971, 5, 
37-45. j * 

Winters, L. The effects of differing response Criteria on the assessment 
of writing competence. Report to the National Institute of Education, 
November 1978. (Grant No. OB-NIE-G-78-0213 to the Center for the Study 
of Evaluation.) 



24 3 



J 



THE MEASUREMENT OF 'STUDENTS ' WRITING PERFORMANCE 
IN RELATION TO INSTRUCTIONAL HISTORY 



Marcella Pitts 



Center for the Study of Evaluation 

Graduate School of Education 
University of California, Los Angeles 



The project presented or reported herein was performed pursuant to a 
grant from the National Institute of Education, Department of Education 
However, the opinions expressed herein do not necessarily reflect the 
position or policy of the National Institute of Education, and no 
official endorsement by the National Institute of Education should be 

inferred. 



X2CSE/B 

249 



THE RELATIONSHIP OF CLASSROOM INSTRUCTIONAL CHARACTERISTICS 
AND WRITING IN THE DESCRIPTIVE/NARRATIVE MODE 

Public concern overman apparent decline in students! writing skills 
has prompted educators to examine two central issues: (1) the design of 
composition curricula- and (2) the valid and reliable assessment of stu- 
dents 1 writing-' performance. this study addressed^ 'these two issues by 
describing, instructional characteristics in a specific curriculum, by 
developing and employing an analytic rating scale to evaluate students 1 
writing performance, and by examining the relationship of instructional 
characteristics to yriting performance. 

This research was exploratory in nature and exhibits limitations 
inherent in exploratory studies such as small sample size. Nevertheless, 
the study provided descriptive information about selected instructional 
characteristics of composition classrooms and, thereby, provided data 
relevant to current concerns about composition curricula. Secondly, 
general izable procedures for constructing and field testing an analytic 
rating scale and for training raters in its use were obtained from this 
research and may contribute to our knowledge base in- the area of writing 
assessment. 

Among recent developments in our efforts to calm public anxiety bver 

students 1 writing, skills and to discern how best to teach writing and 

assess students 1 writing are the design- and implementation of composition 

courses incorporated as requirements into the high school curriculum. This 

research focused on a typical curriculum change designed to improve writing 

skill, a one-semester required composition course developed by a large 

o 

i urban school district and incorporated into the curriculum at the eleventh 

■j 

grade. 



. Briefly, the course provides instruction in four domains of writing* 
(1) the sensory/descriptive, (2) the imaginative/narrative, (3) the prac- 
tical/informative, arid (4) the agalytic/expository. The recommended 
minimum number of compositions for each of the domains is three, making the 
minimal number of completed compositions for the semester 12. Teachers. are 
encouraged to offer instruction in each of the domains and to include in 
their instruction: (1) prewriting and precomposing activities to elicit 
ideas from students and to motivate them to write; (2) writing practice to 
increase flexibility, fluency, skill, and confidence; (3) reinforcement; 
and (4) instruction in grammar as it relates to the writing process. 

The course's curriculum outline and these recommended activities were 
valuable resources in the design of two instructional questionnaires, the 
primary data collection instruments in the study. 

Information concerning instructional practices , was obtained from 
teachers and students for a selected group of variables: (1), communication 
of instructional outcomes to students, (2) writing practice, (3) feedback, 
(4) instructional time use, and (5) teacher expectation. In "addition, 
papers previously assigned and graded by the teachers supplied information 
about, the usual emphases and specificity of correction provided students. 

Foremost in the selection of these variables over other instruc- 
tidnally important dimensions identified in the literature was the fact 
that they involve concrete instructional events. The presence, absence, 
and frequency of occurrence of these events can be monitored and reported 
by teachers and students. This was an important consideration given the 
methodology used in the study, which relied heavily on teagher and student 
self-report. 



Students 1 writing performance was measured by their combined scores on 
two narrative/descriptive writing tasks.. An* analytic' rating scale, 
developed for the study and appropriate for the narrative/descriptive mode, 
was employed by three high school teachers to rate the writing samples. 
The teachers, all of whom had rated essays previously, were trained in the 
use of the ratihg scale. 

Sample 

— v / 

The subjects of the study were the students and teachers in 19 compo- 
sition classrooms in five high schools in a large urban school District, 
The selection of the schools in the sample was based on achievement and 
demographic data published annually by tfie district. These data were used 
to develop profiles of individual high schools in the district; the.five 
schools selected for inclusion in *.the study had relatively homogeneous 
profiles along these dimensions. 

The number of classes, participating ranged between three and four per 
school. Participation was voluntary, with thef decision to take part in the 
study resting with the individual teachers. Six of the classrooms were 
designated by the participating schools as advanced CabSve average); 11 as 
average; and two as skill (below average) classes-. 

'Procedure 

Data collection in the five schools took place during the last two 
weeks of May 1978. Visits to each school were scheduled to provide for an. 
interval of approximately one week between,. writing assignments. Forty 
minutes writing time was allotted for each writing occasion; order of 
topics was counterbalanced by class. 

3 

252 



During this period teachers provided the Investigator with a set of 
previously -graded student expositions. After the writing samples and sets 
of graded papers Jnkd been Elected, students and teachers* completed the 7 
instructional questionnaires. * , - 

All the essays were returned to the students at the completion of the 
study. Several teachers used the essay & a graded class assignment. 

Independent Variables " r - % 

The independent variables in the study were: (1) communication of 
instructional outcomes, (2) use of instructional time, (3) writing prac- 
tice, (4) feedback, and (5) s teacher expectation, information related to 
each of these variables was collected from teachers and students via 
questionnaires. Parallel items pertaining to many of the variables 
appeared on both* the teacher 'and stident questionnaires. 

The first independent vsnable, communication of instructional out- 
comes or intent, was operationally defined as informing students of the 
skills they were expected to acquire at the end of the semester. In order 
to ascertain the extent to which teachers had successfully communicated 
this information/ teachers and students were provided with parallel lists 
of post-instructional skills arid asked to Select those, which matched most 
closely the instructional outepmes in their classrooms." An index which 
measured the agreement between the skills selected by students and teachers 
was then computed. 

The second independent variable, time on academic content, was opera- 
tionally defined 'as the amount of time spent on: . (1) modes of discourse, 
(2) writfng activities, and (3) features of writing measured by the 
analytic rating scale developed for the study. Data for the' first and 

4 

. 253 



second dimensions of this variable were purely descriptive. They involved 

teachers 1 testimates of the percentage of instructional time spent on each. 

* * * 

/of the' four domains pf writing in the curriculum and teachers 1 and stu- 
dents 1 reports of the activities on which class time was spent (e.g., 
reading composition texts, reading literature, prewriting* discussions; 
in-class writing). \ 

The third dimension of this * variable was measured by questionnaire 
item's which required teachers and students to indicate the number of class 

periods (i.£. , 0 y 1-5; 6-10, over 10) spent on specific features of writing 

^ « » 

related G to description, narrative order, and mechanics. 

The third independent variable, practice, was defined as the amount of 
writing, i.e., frequency and length ytff^assignments written in class and as 
homework. Another dimension of the practice variable for which descriptive 
data were obtained was the type and numb^ of writing assignments students 
completed over the semester. Teachers and students estimated the number of 
completed 'compositions that were expository, descriptive, narnative, multi- 
paragraph, single-paragraph, etc. 

The fourth variable; feedback, was composed of three dimensions: 
immediacy, helpfulness, and- instructiveness. The first dimension, immed- 
iacy, of feedback, was measured by parallel questionnaire items in which 
teachers and students estimated the time period within which students' 
papers* were usually read and returned. Since the results of previous 
research and data from pilot tests had indicated that the most common 
methods of providing feedback were individual conferences with students and 
written comments on compositions, information was obtained from students 
concerni ngjpiese evaluation practices. Helpfulness of written feedback pn 
corrected papers and feedback received during individual conferences with 



254 



the teacher wefe measured by students 1 ratings. An, item on the student 
questionnaire also provided a measure of the instructiveness of- feedback, 
and additional d<ata related to this dimension were obtained in a separate 
analysis Qf sets of previously assigned and corrected papers provided by 
the teachers. In addition, teacheVs reported on the features on which they 
focused during correction and the usual methods they used to provide 

feedback. - 

a-, 

The fifth independent Variable in the .study, teacher expectation, 0 was 
measured by teachers 1 recollections of the amount of improvement they had 
.expected in students 1 performance at c the beginning of the "semester. 

Dependent Measure • * 

The' writing task in the study was closely related to -the sensory/ 
descriptive and imaginative/narrative domains described in the curriculum 
outline. The task* primarily but not purely narrative, was structured so 
, that descriptive detail would be included in the two compositions students 
wrote. 

Both of the ^writing assignments used pictures as writing stimuli. 

Students were directed to write about the scene in the pictures and the 

events which might have preceded and followed it. They were to include in 

their essays descriptive detail for readers who would not have an oppor- 

4 f 
tuaity to look at the pictures. Further, students were directed to use the 

/ r 

third person point-of-view. They were told that the purpose of the assign- 

*\ * 

ment was ,to write a story based on the pfcture and to include descriptive 
detail related to setting, characters, and action. 

The selection of the pictures^ and the development of the directions 
f oc* the two assignments were based on data from field tests with comparable 
students. 

6 

e 

\ : 25,5. 



A narrative/descriptive analytic rating scale was developed* and used 

! 

# 'to evaluate "students' essays. Jhe features included on the scale were 

< . 

derived from a survey>of theoretical and practical works, on descriptive and. 
narrative writing . as -modes of discourse and from-an examination of pub- 
lished rating scales. Draft versions of the scale were reviewed by faculty 
in the UCLA English department and by staff at the Center for the Study of 
_ ^Evaluation. The final version of the scale reflected the changes .suggested 
by the reviewers as well as minor modifications agreed upon Dy the investi- 
, gator and the readers prior to the actual rating of students 1 essays. 
As a first step in developing the scale,' a review of available theo- 
retical and .practical pieces on descriptive and narrative writing was 
i * 
% * * 

, conducted. This review resulted in the identification of four essential 

features of narrative/descriptive 'writing which appeared to be appropriate 
criteria for evaluating relatively short pieces written under timed condi- 
tions: setting, characterization, action, and descriptive detail, the 
. inclusion <jof which contributes t6 the reader's sense of setting, the 
characters, and the action or sequence of, events. 

The selection tff these "features' was supported by a state-of-the-art 
review of published analytic rating scales relevant to narrative/descrip- 

4 

tive writing. A review of non-mode-specific features of these and other 
prominent analytic rating scales was also undertaken to identify elements 

related to mechanics for Inclusion in the scale. . • 

* ^ 

Based on this review, a- sentence-structure/diction subscale and a 

grammer/spelling subscale were developed. The criteria on the first scale 

.includp fluency and variety of sentence structure, the selection of clear 

and specific words and their correct use. The grammar dimension of the 

-grammar/spelling subscale focuses on reference errors, tense shifts, 
r * 



7 i 

256 



punctuation errors, misplaced modifiers, and the like. The first mode- 
specific subscala, sequence/coherence, includes criteria related to 
temporal order of events, their logical development, and the continuity 
with which they are developed. While the criteria for this subscale focus 
on the narrative aspects of the writing task, the criteria "for the other 
mode-specific i features, setting and characterization, focused more on the 
descriptive aspects of the task. 

Students 1 essays (n = 228) were rated by three trained readers for 
each of the features on the narrative/descriptive scale. All tfi'e readers 



v 



were high school English teachers; all had prior experience' in holistic 
rating. Approximately two days were spent training the readers and another 
one and one-half days were** required to re s ad and rate the essays. 

Prior to the training and rating sessions, all identifying information 
was removed from the students 1 essays, and 'each paper was assigned a. code 
number. All- the essays were then typed to facilitate rapid reading and to 
remove the confounding effects of handwri* "ng. No corrections of any kind 
were made on -the typewritten versions: They were duplicates^f the hand- 
written essays in every respect. In addition to the 228'sample assays, 140 
extra essays were prepared in a similar manner for use in rSiter training 
and in calculating inter-rater reliability. 

* # , Both days of rater training yere full-day sessions. "The morning of 

the first day wcs spent reading, discussing, and applying the rating scale 

H 

to specially selected training essays -chosen to represent the range of 
essays in the sample. A procedure was followed in which the readers rated 
one, two, or three essays individually and then discussed their ratings., 
During the discussions discrepant ratings were examined; elements on the 
scale and the terms used to describe them were clarified. A first inter- 



25V 



rater reliability check was conducted during the afternoon. A three-way 
analysis of variance (ANOVA) was used to calculate, inter-rater reliability 
for the ratings assigned to 40 essays (two essays written by each, of 20 
students) read over a one and one-half hour period- The three factors in 
the analysis were subjects, topics, and raters: A mixed model was employed 
in which subjects and raters were random factors and. topics were fixed. 

The second day* of training was spent discussing, refining, and apply- 
ing the subscales for which the initial reliability . coefficients had been 
considerably below the .8C^evel for average ratings chosen as the test of 
acceptability. The rating of the 228 essays began when the inter-rater 
reliability coefficients for each subscale had increased and were greater 
than .80. 

The essays were placed in a different random order for each rater so 
that the likelittood of all readers rating the same essay at the same point 
in time was > reduced. However, a common group of essays was included in 
each reader's stack of papers so that another reliability check could be 
conducted. 

The. final inter-rater reliability for the^228 essays was quite high, 
with average ratings of .88 and .89 and single'ratings of .72 and .74. The 
total score reliabilities were .94 (average ratings) and .87 (single 
ratings). 

On the average, student performance was not high or low on any one 
subscale. Furthermore, the mean performance of students m classes desig- 
nated as above average was consistently higher than the mean performance of 
students in average and below average classes. The mean performance of 
average level students was, in turn, consistently higher than the mean 
performance of students in below average classes. In all cases writing 

258 



performance was measured by total writing score for each student or mean 
total writing score in each .classroom* 

v 

Data Analysis and Results 

A two-stage data analysis was performed to examine the relationships 
between students 1 and- teachers 1 reports on classroom use of instructional 
variables and the- quality of'student writing samples. 

In the first stage of the analysis, descriptive statistics were per- 
formed .on* the data from the teacher and student questionnaires. The 
research questions addressed were: 

'1. How are selected instructional variables employed in the composi- 
tion classroom: (a) as reported by students? (b) as reported by 
teachers? r 
° Are* intended instructional ^outcomes communicated ' to 



y . students? j ^ 

° How do teachers allocate instructional time to writTtyj^ 

activities, modes of discourse, and specific writing skills? 

• ° What kind and how much opportunity for writing practice is 

provided? - 

° ' What is the time interval foV feedback, what form does it 

take, and how instructive is the feedback provided students? 

2> What expectations do teachers have concerning students 1 writing 

performance? / 

t 

In the second stage of the analysis, a series of multiple regressions^ 
was performed to examine the relationships among reported use of the inde- . 
pendant variables and student writing performance* The research questions 
addressed were: 



10 

259 



1. What is the relationship between students 1 writing, performance 
and use of the four instructional variables: (a) as reported by 
students? (b) as reported by teachers? * 

2. What is the relationship* between students 1 writing perform- 
ance, use of the four instructional variables, and teachers 1 
expectations? 

Results of the descriptive analysis of the questionnaire data provide 
a rich base of information concerning instructional practices in the 19 
classes in the sample. 

According to the teachers and students in these classes, the intended 
instructional outcomes in the. majority of the classrooms were: to write 
contf^te and grammatically correct sentences; to write well -organized 
essays; to include supporting detail in essays; to use a consistent poipt 
of view i writing; to follow accepted standards of usage; and to express 
ideas in an original way. Students in above average level classes were 
most in concert with their teachers regarding the instructional outcomes in 
their classrooms. 

With respect to the classroom activities designed to achieve there, 
outcomes, teachers and students in the majority of the classrooms agreed 
that activities were prewriting discussion, in-class writing, composition 
analysis, reading literature, and listening to formal lectures by the 
teacher. The first three activities are recommended * in the course 1 s 
curriculum guide. * 

The cjuide ^ so suggests that teachers spend approximately equal time 
during the semester on each of the four writfng domains specified in the 
curriculum. In fact, eight of the teachers in the sample indicated they 

r > 

ii • . 

?60 



divided their available instructional time equally among the four domains. 
The majority of these teachers taught above average classes. 

With respect to the type qf assignments, completed, the majority of the" 
teachers and thei r students-- 'agreed that teachers offered one-to-five 
assignments for narrative, descriptive, expository; and argumentative 
writing over the semester. Also, more than half of the students indicated 
they had written one-to-five short stories during the course of the 
semester. Teachers of average level classes assigned more grammar exer- 
cises than the/*r counterparts in above average classes, while this lajtt^r 
group assigned m'ore research papers and multi-paragraph essays. 

As might be expected, all the students in 'the sample wrote in class 
more frequently than at home. Teachers of above average classes, however, 
had their students engage in writing activities more often, both home 
and in class, than did average level teachers. Moreover, the in-class 
essays of above average students were longer than those required of average 
level students. 

Above average class teachers also spent more pt'ime in individual con- 
ferences with students and provided more specific rules and suggestions for 
improvement in the written comments on students 1 'papers. Less than half of 
the teachers of average classes had individual conferences with their 
students to* discuss an assignment. They also wrote fewer directive 
comments on students 1 papers and, as might be expected, had a faster 
turnaround time for corrected papers. 

In" an analysis of the comments on previously graded papers provided by 
the teachers, the comments on over one quarter of the sets were rated as 
highly directive since they included specific rules and suggestions for 
students. The majority of these papers were from above average level 

12 

261 



classes. Slightly less than one quarter of the sets provided specific 
indications of strengths and weakness but failed to suggest specific stra- 
tegies to improve the paper. A similar percentage contained no comments; 
and the comments on the remaining papers, nearly one quarter of the total, 
were too general to be of any instructive value and contained only general 
remarks about the paper. 

In addition, teachers reported they attended to content and mechanics 
or organization and mechanics when they corrected students 1 papers. As 
might be expected, the analysis of both the interlinear notations and 
comments on the sets of previously graded papers showed that more notations 
pertained to mechanics. . 

i 

The results of the descriptive analysis indicate that important dif- 
ferences may exist in instruction between competency levels. The pattern 
of instruction in above average level classes seemed to rely upon and 
extend students 1 initial writing skills. Students wrote more often, v/rote 
longer essays, had more individual conferences with their teachers, and 
received more instructive feedback than did students in average level 
classes. Not only did teachers of average level classes make shorter 
assignments, these included more grammar practice. Grammar exercises are 
de-emphasized in above average classes. o Thus, the data suggest that these 
teachers teach to the competency level of their classes. More competent 
classes receive more demanding instruction; less is expected and asked of 
less competent groups. Indeed, teachers 1 expectations concerning the 
amount of improvement in students 1 writing performance over the semester 
is positively and significantly correlated with the school-designated 
competency or tracking level of the classes. 



The powerful influence of tracking level on student performance was 
more apparent in the regression analyses performed. Results of tire 
multiple regressions based on students 1 reports, teachers 1 reports, and the 
discrepency between the perceptions of both groups indicated that, for the 



significant variable related to students' writing performance. Thus the 
analyses revealed no relationship between reported Instructional practices 
and students 1 performance, despite the findings that such practices tend to 
vary for classes irTcKfferent tracking levels. 

These results may derive from limitations in the design and scope of 
the study and from additional constraints imposed by the curriculum itself. 
Nevertheless , they inform further research arid invite secondary analysis. 

Because of the exploratory nature of this study, the sample size, 
using the classroom as the unit of analysis, was extremely small (n = 19). 
When the sample was further subdivided into tracking levels, numbers were 
even smaller: average classes, n = 11; above average classes, a = 6; below 
average classes, n = 2. Within the limitations of the sample size, it was 
impossible to examine the relationship between total writing score or 
subscale scores and instruction due to the correlation between instruction 
and tracking level. 

An additional constraint which may have hampered the discovery of 
significant relationships lies in the curriculum itself: This course is 
only a semester long, a n d yet the curriculum requires instruction in each 
of four writing modes. As reported in the findings above, teachers do in 
fact provide instruction in all four modes. Furthermore, there is some 
indication that the course is more cf a survey of different writing domains 
than extensi ve dri 1 1 . For exampl e , the curri cul urn recommends that 



variables examined in this study, classroom tracking level is the single 



14 




a minimum of three assignments be completed in each of the four domains 
over the semester. Given, that the mission of this course is to provide 
students with basic writing competencies if they have not mastered these in 
previous classes, this number appears to be quite conservative.' The 
results showed that teachers provided a moderate amount of writing practice 
and moderate number of writing assignments in contrast to the more inten- 
sive instruction that might be expected in a composition course of this 
nature. Perhaps limiting the curriculum to fewer modes of discourse or 
expanding^ the course length to one full year to accommodate all, four modes 
would strengthen instructional effects. Future studies under such condi- 
tions might uncover relationships that were too weak to appear in the 
present study. 

Despite these limitations, other methodological features of the study 
appear to be promising strategies for studies of this type. First, the 
collaborative reports of teachers and students provided a reasonable 
indicator of instructional practices. Teachers and students, especially 
those in above average classes , were in considerable agreement, and the 
collection of survey data from both of these groups seems feasible and 
practical, especially at the senior high level Future work might make use 
of more frequent surveys throughout the semester to prevent honest inaccu- 
racies in recalling information over a long period of time. Studies should 
also include direct observation of classrooms. Observation of ongoing 
classroofo interactions and instructional processes would allow more precise 
description of instruction and corroboration of questionnaire data. 

A second promising strategy included in this study which could be 
incoporated into future v/ork was the examination of teachers 1 naturally 
occurring comments as a way to qualify their * self-report data. Other 

15 

284 



procedures of qualifying the data provided in this type of study should be 
examined as weVL 

Another product of the study which can be applied in other research 
studies is the narrative/descriptive analytic rating scale developed for 
the study. ' Experienced readers with a minimal amount of training can 
achieve highly reliable ratings using this seal 6. Moreover, it appears to 

*be a valid measure of writing performance given the high correlation 

' & 

between tracking level and the mean total writing score of students in a 
particular classroom. 




ERIC 



16 



265 




APPENDIX 




.ERIC 7 



17 



266 



NARRATIVE/DESCRIPTIVE RATING SCALE . , 

Sequence/Coherence— criteria for* rating : 

1 point There is no clear temporal (chronological) order to the 
events in the narrative. The reader is not sure wfticH 
event comes first or follows any other ev%nt. In fact a 
sequence may not be related at all. The paper**may be 

■ 0 

purely expository or descriptive, 

2 points There is a noticeable beginning and end although the 

temporal order of events may not* be clear, ' Events are 
merely listed rather than progressively and logically ' 
related to each other. Sentences and paragraphs. are 
■ poorly tied together. There are lapses in coherence; or, 
. * if transitions are used, they may be used incorrectly o$ 
repetitiously. 

3 points The temporal order of events is clear. Transitions are 

used correctly. The paper has continuity and there is a 

clear progression of ideas, although there may be minor 

. * 

lapses in motivation and logic. ' 4 

> , 

4 points This paper has all the elements of a "3" paper, with the 

addition of a sense of control from beginning^to end. Th 
sequencing of events is so well done that tjie reader has 
sense of movement. There is a logical progression of 
ideas. Transitions may be expertly used and movement 
facilitated by a variety of transitions. The paper is 
often interesting, original and. may include conflict. 



18 • 

207 



Setting— criteria for rating : - ' . m 

1 point The setting of the narrative is not clear to the reader 
because: (1) the writer does not specify where &c when 
f the action is taking-place; (2) the reader is unable to 
infer the setting from the information included in the { 
narrative, the .setting is so vague, general and unsjpeci- 
fic that the reader has no image of time or place. 

2 points The setting of the narrative is apparent to the reader. 

The actual place or time is stated or infeVred, but there 

* * * 

is little or no elaboration. % ^ 

3 points The reader has a clear understanding of setting. The 

setting is more explicitly stated than in a "2" paper. 
Details may relate to geographic location, time period/ or 
general environment through which the characters mov^., 

4 points This paper includes all the* elemenJ-.s of a "3" pap^, with 

the addition of excellent use of detail. Th^i/riter uses 
specific detail to describe the setting. The setting is 
so developed that it seems to give the events a "real 11 
place in which to happen. The setting is an important 
component in furthering the narrative. 

Characterization— criteria for rating : 

1 point Characters are not identified or only barely identified:^ 
by name, noun, pronoun or there may be one or more fldjec- 
tives which act like labels . However, there js no con- 
scious attempt to develop the characters through their 
speech, actions, reactions to other characters ^or other 
characters' reactions to them, 

' 268 



2 poiriti Compared to a "1" paper, this narrative includes mors 

4 

information "about one or more characters but this informa- 
tion is not elaborated. Detai/s may only be listed, not 
developed. Characters are not clearly established. 

3 points Detail, interpretative comments, specific actions and 

reactions *pf the characters may be included. One or more 
of /the characters may be a : tereotype. Character is 

r 

established and a specific direction for development is 
indicated. 

t 

4 points One or. more of the characters in the narrative may emerge 

• as a unique, attention-getting person- A specific char- 
acter^-^ well-developed through dialogue, action, reac- 




(ons to other characters, or by descriptions and/or 
interpretations of the character's appearence, feelings, 
f or thoughts. 

IV. Sentence Structure/Diction— criteria for rating : 

1 point Sentence? are garbled, incomplete. Numerous structural 
problems interfere with the reader's comprehension^. The 
sentences are not coherent; words are merely strung 
together. Monosyllabic words are used *.nd the vocabulary 
is childish. 

2 p/ints Sentences may be short and choppy or run-on. There may 
' be fragments and comma splices. V/ord choice is limited. 

Words may be used incorrectly, repetitiously and 
inaccurately. 

3 points The sentences read without noticeable breaks and. there is 

variety in sentence structure. There may be some sentence 



' *o 269 



-errors-but the paper is fluent. Word choice is exact and 



appropriate although uninspired. There may be several 



cliches and overworked expressions. 'The paper may be 



stilted or inflated. . 



4 points The paper has mature sentences making it easy and pleasing 
to read: It is marked by strong and precise diction. 
Vivid descriptive words which suit the writers purpose 
are used. 



Grammar/Spelling-- criteria for ratipg : 

' 1 T/ 

1 point There are numerous grammatical errors (e.g., agreement, 
pronoun reference, misfp\aced modifiers, tense shift, 
punctuation) which Interfere with the paper's readability. 
The writer seems to have no grasp pf basic spelling rules! 

2 points Thi? paper is readable although\he grammatical errors are 

distracting, "jpnere ar& several spelling errors in common 
words. 

3 points The paper 16 basically competent. Errors are noticeable 
but they /do not interfere with the writer's message. 

Del lira errors occur in words that are harder to spell. 

4 points This/paper has very few or no grammatical or spell ing* 
4ors. the errors that remain make little difference to 

the reader; they are editorial problems and slips. 

- 4>. 



21 

270 



References 



Braddock, R. , Lloyd-Jones, R. , & Schoer, L Research in written 

c omposition . Urbana, .Illinois: National Council of Teachers of p t 
n -English, 1963, 

Conlan, G. How the essay in the CEEB English test is scored . Princeton, 
N".J.^ Educational Testing Service, 1976. 

Follman, J. C. , & Anderson, J. A. An investigation of the reliability 
of five procedures for grading English themes. Research in the 
Teaching of English , 1967, 190-200. 

Hambleton, R. , Swaminathan, H. , Algina, J., & Coulson, D. .Criterion 
referenced testing and measurement: A review of technical issues 
'and developments. Review of Educational Research , 1S77, 48(1). 

Harris, C. W. Some technical characteristics of rfastery tests. In 
C. W. Harris, M. C, Alkin, & W. J. Popham (Ed?.), Problems in 
criterion-referenced testing . CSE Monograph Series, in Evaluation, 
No. 3. Los< Angeles: Center for the Study of Evaluation, University 
of California, 1974. 

Howerton, M. C. , Jacobson, M. & Eldon, R. The relationship between 
quantitative and qualitative measures of writing skill- Paper 
presented at the Annual Meeting of the American Educational 
Research Association, New York, April 1977. * j 

Hunt, K. W. Early blooming and late blooming syntactic structures. 'In 
C. Cooper & L. Odell (Eds.), Evaluating writing , j State University * 
of New York at Buffalo, 1977. I 

Lloyd-Jones, R. Primary trait scoring. ' In C. Cooper & L. Odell (Eds.), 
Evaluating writing: Describing, measuring, judging . Urbana, 111.: 
National Council of Teachers of English, 1977. < 

Millman, J. 4 Criterion-referenced measurement. In W.j J. Popham (Ed.), 
Evaluation in education: Current applications . I Berkeley, Calif.: 
McCutchan, 1974. 

0 ! Hare'', F. Sf snce combining: Improving student writing without, formal 
grammar i ruction. NCTE Research Report No. 15. Urbana,' 111.: 
National , icil of Teachers of English, 1973. 



c 



MEASURES OF HIGH SCHOOL STUDENTS' EXPOSITORY WRITING: 
DIRECT AND INDIRECT STRATEGIES 

y • 

Laura* Spooner Smit.h 




Center for the Study of Evaluation 
UCLA Graduate School of Education 
Los Angeles, California 90024 



i 




The project presented or reported herein was performed pursuant to i 
grnat from the National Institute of Education, Department of Education. 
However, the opinions expressed herein do not necessarily reflect the 
position or policy of the National Institute of Education, and no 
official endorsement by the National Institute of Education should- be 
inferred. 



X2CSE/C 



4 



ABSTRACT 

As demand increases for competency-based tests of students 1 *>asic * 
adacemic skills, additional requirements for measures of writing profi- 
ciency also have surfaced. The need today is for measures of writing that 
not only are technically sound, but which also serve as meaningful, 

efficient indicators of clearly defined writing competencies* s Additfon- 

i fc ■ 
ally, the demand is for measures that carry clear implications for 

instructional planning. f . 



273 



MEASURES OF. HIGH SCHOOL STUDENTS' EXPOSITORY WRITING: - 
DIRECT AND INDIRECT STRATEGIES 

As demand increases for competency-based tests of students 1 basic 
academic skills, additional requirements for measures of writing orofi- 
ciency also have surfaced. The need today is for measures of writing 

• T * 

that not only are technically sound, but which also serve as meaningful, 
efficient, indicatprs- of clearly defined writing competencies. Addi- 
tionally, the demand is for measdres that carry clear implications for 
instructional planning. 

" The present study- was undertaken in an effort to examine relation- 
ships among writing assessment strategies which are potentially respon- 
sive to requirements . of competency-based testing. Three alternate 
strategies to measure secondary students 1 expositor^ writing were 
developed and administered. Two of the strategies, direct measures, 
involved collecting and rating students 1 writing samples. The clistioc- 
ti on between the direct measure strategies lay in the response criteria 
by which the samples were judged. One form of criteria, an Analytic 
rating scale, required raters to assign scores to six different cti'ar- 
acteristics of. the writing samples. The other form of criteria, an 
Impressionistic rating scale, yielded a single score on the quality of 
each essay as an example of exposition. The third strategy, an indirect 
measure, was an "objective test of writing-related competencies derived 
from the Analytic* rating scale. * 



274 



Background 

The measurement of writing and .writing-related competencies his- 
torft£lly has presented unique technical and validation problems. From 
pioneer- assessment efforts to current composition tests, measurement 
experts have contended with such recurring problems as fluctuating test 
reliability, implementation of efficient scoring procedures, and examin- 
ing the validity of indirect, i.e., objective, measures of writing. 

In recent years, additional requirements for measures pf writing 
have arisen along with increasing demand for conpetency-based tests of 
basic skills. This situation has prompted renewed attention to a funda- 
mental concern in writfntj assessment: How to identify and define measure- 
ment tasks that, will serve as efficient and instructionally meaningful 
indicators of writing competence, # Identifying test tasks (or items) to 
measure a domain of learning is, of coursfe, ,an intellectually difficult 
aspect of competency-based assessment, especially for complex behaviors 
such as the/production of written discourse. .Millman's (1974) assertion 
that a performance domain should be defined 11 . . . by those facets and 
elements that make a difference in how" the learner responds" leads 
naturally to the question, what facets make a difference? % 

The Response Criteria Issue 

In the measurement of writing skills, one "facet", that clearly 
makes a difference is the criteria .by which writing samples are judged. 
Although a variety of guided scoring procedures for sorting or ranking 
written pieces currently are in use, most can be classified as either 
"analytic" or "holistic." . 



275 



Analytic scoring procedures presume that a piece of writing can be 
viewed as consisting of component; but not necessarily independent, 
parts which are worthy of individual scrutiny. The procedures require 
raters to assign points to each of several specified aspects of a compo- 
sition and yield an estimate of overall quality of the writing product, 
as well as sub-scores on separate elements of the writing sample. 
Analytic rating procedures * aptly meet the requirements of competency- 
based assessment programs which call for information on specific skill 
strengths and deficits. ^ , 

In contrast, holistic rating procedures assume that . . each 

factor that makes up writing skill is related to all othery actors and 

that one factor cannot be easily separated from others" (Office of the 

Los Angeles County Superintendent of Schools, 1977). Under holistic 

procedures, raters * assign a single value to a piece of writing. A 

specific type of holistic procedure, impressionistic rating scales, also 

requires assignment of a single value to a written product. Unlike most 

holistic approaches, impressionistic scales* involve only a minimal 

rubric to guide the judgments of raters. The apparent advantage of 

holistic rating procedures ii) general, and impressionistic scales in 

particular, is that they tend to be an efficient and reliable direct 

measure of writing performance. The drawback is the lack of precise 

^ / 
information on the attributes of writing to which raters attend. Whether 

analytic and holistic procedures provide substantially comparable esti- 
mates of writing competence is an important question, especially within^ 
the context of competency-based assessment. 



3 

276 



The Response Mode Issue 

Another "facet 11 in the measurement of writing skills is the form of 
response elicited from examinees. Indeed, an important issue in writing 
assessment centers on the relative merits of two contending response 
modes: production ot a piece of writing, i.e., "writing samples," and 
selecting a response from among given alternatives, i.e., "objective" 
tests of writing. 

That there is any question regarding the. most appropriate approach 
for assessing composition skills at first may appear puzzling. Clearly, 
writing samples represent a direct measure of writing performance and 
therefore possess prima fo^ie validity. This, in fact, is the principal 
argument supplied by proponents of writing samples as a means of composir 
tion assessment. Those who favor more indirect methods, I.e., objective 
tests, however, have pointed to the usual unreliability of writing 
samples, notably the difficulty of ensuring reliable scoring procedures 
and the tendency of a Iter's performance to fluctuate in quality over 
time and task. Problems of unreliability in writing samples^ave been 
the subject of recent research -which has resulted in promising procedures 
aimed at enhancing measurement reliability. Nonetheless, a clear advan- 
tage of objective methods of assessment is the efficiency with which the 
measures can be administered and scored, an economy of special importance 
when sizable aggregates of students are to be tested. 

Arguments of effi^ency, though, have failed to impress many profes- 
sionals in the discipline of English, who have voiced concern over the 
inherent lack of content validity of indirect approaches. In response 
to such objections, proponents of indirect measures have cited findings 
from studies (e.g., Godshalk et a . , 1966) which reveal statistically 



ERIC, 27/ 



significant correlations between performance on objective tests of 
writing .and performance on actual writing samples. 

The 1 imitation of many correlational and predictive studies of 
objective tests of writing for competency-based assessment, however, 
centers on the criterion against which the items are validated, notably 
the use of hoi istic "scoring procedures to rate the criterion essays. 
Recall that hoi istic scoring resujts in a single score on . the overal 1 
quality of an essay: separate scores on different characteristics of an 
ess x ay are not provided. Consequently, no inferences can be drawn about 
possible relationships -between the classes of' skill tapped by objectve 
items and the classes of ski i 1 exhibited in the criterion essay. At 
beat, one can conclude that various combination of skill measured' by 
objective items are related in some unspecified fashion to actual writing 
performance. What appears to be needed is information on the relation- 
ships between well-defined classes of objective items and equally well- 
defined writing production measures. 

Method r ^ 

Subjects, 128 eleventh and twelfth grade students in six English 
classes in the Los Angeles area, were randomly assigned within each 
class to treatment groups (jjtetermined by the order in\which the measures 
were administered. Table 1 depicts 'the three measurement strategies. 
Each subject wrote two essays of at least ?00 words on topics designed 
to elicit expository writing and completed an objective test of writing- 
related competencies. Two raters were trained to employ an Analytic 
i 

ratihg scale and two to use an Impressionistic rating scale. The writing 

r 

samples were scored by both rater pairs, resulting in four total scores 

> 

5 • 

, -278 



for each sample. Final study reliability of ratings on ""the Analytic 
total scale was .90, with rater reliabilities for the six Analytic sub- 
scales ranging from .84 to .95. Rater reliability for the Impressionistic 
scale was .87. 

Measures 

The direct measure writing task employed in the study consisted of 
two major components: the writing topics (and directions) and two forms 
of rating criteria. The- topics, designed to elicit like-samples of 
student writing, were intended to promote writing within the discourse 
domain of exposition, that is writing 11 . . . that explains or clarifies a 
subject" (Brooks & Warren, 1961). Task attributes guiding development of 
the topics included discourse mode, rhetorical purpose, content limits, 
and intended audience. 

One form of rating criteria, the Analytic scale, was designed to 
reflect state-of-the-art pedagogical precepts and practice in composi- 
tion. Elements on the scale were derived from an analysis of conven- 
tionally recognized structure features of exposition as indicated in 
curriculum guides and textbooks at the secondary level. The final ver- 
sion of the Analytic scale yielded scores on six subscales corresponding 
to the following elements of writing: essay focus (main idea), organiza- 
tion, development, support, paragraphing, mechanics. The range of points 
for each of the Analytic subscales was four (high) to one (low). The 
rating rubric contained descriptions of essay characteristics for each of 
the four levels for all six subscales. Table 2 presents an abbreviated 
version of the Analytic scale. 



ERIC • 273 



The second form of response criteria, the Impressionistic rating 
scale, required raters to wake a judgment regarding the overall 'quality ^ 
of each writing s^iple as an example of effective exposition. The rating 
rubric directed rafers to assign each essay a single numerical score by 
employing a six-point scale, with six .(high) and one (low). #Jn addition 
to the^scale, the rubric contained several definitions outlining prominent 
conventionally recognized features of exposition. 

Additionally, an indirect measure, a 37-item multiple-choice test, 
was developed to measure skills presumably related to actb^l production 
of expository writing. The ^ skills covered in- the test wereVi^ent^if ied 
through two related analyses. The first analysis consisted of a review 
of expository writing skills frequently emphasized in secondary compos 1- 
tion curricula and instructional materials for which selected-response 
type practice was provided. For example, a typical exercise required 
students to ^identify from a lost those details which either do or do not 
support a given generalization. The skills identified through the review 
were then arrayed against elesfents listed in the Analytic rating scale in 
order to determine which skills were conceptually analogous to the rating 
scale elements. . 

/The final version of 'the Objective measure 'contained a subtest for 
each of -the following Analytic scale elements: focus (main idea), 
development, organization, support, paragraphing. Each of the five 
subtests contained five similar-ffeqrmat items which were generated accord- 
ing to a set of test item specifications. Objectives for the subtests 
are presented in Table 3. The method by which items for a sixth subtest 
were generated was different from that of the first five subtests. 



ERIC . , 250 



Unlike the stimulus passages of. the first five subtests, which were 
generated specifically for the objective test, the passages in subtest 
six were drawn from actual samples of students 1 writing. The passages' 
selected exhibited one or more of the several types of errors, e.g., 
failure to state or imply the main idea, lack* of supporting statements. 
Each passage was followed by four items directing the student to identify 
the statement(s) which exhibited a specified category of error. 

Summary of Findings 

' 0 

Relationships Among Analytic Scale 
and Impressionistic Scale Scores 

Correlations among scores frpm the two rating scales are presented 

in Table 4. As shown, scores on the six subscales comprising the Analytic 

scale proved to be hiyhjy related, with correlations rai.gfhp from .69 to 

.90. Correlations' between subscale scores and the Analytic scale total 

scores ranged from .82 to .96. 

The correlation between the Impressionistic scale scores and Analytic 

total scale scores (.81) indicated a strong association between the two 

rating strategies. The association extended to relationships between the 

six Analytic subscales and Impressionistic ratings , with correlations 

ranging from .65 to .80. 

To further examine the relationship between Impressionistic and 

Analytic scores, Impressionistic scores were regressed on the six Analytic 

subscales, which jointly accounted for approximately 75% of the variation 

(F = 53.?36, df = 6,105, p .01) in Impressionistic scores. Two, Analytic 

subscales, Mechanics (F = 30.789) and Support (F = 18.365), proved to be 

significant predictors to Impressionistic scale scores in the model with 

all six subscales (see Attachment 7). 



/ 



The relatively strong associations among the Analytic subscales 
suggested that the relative importance of the subscales as predictors 
may be masked in the regression analyses. To examine this, a new compo- 
site variable, Structure, whiVh was comprised of the sum of scores on 
four Analytic subscales (Organization, Focus, Paragraphing, Development), 
was entered into the equation. The combination of the three subscales 
Mechanics, Support, and Structure accounted* for 74%"of the variation in 
Impressionistic scale subscores (F = 105.159, df ='3,108, p .01). 
Again, Support (F = 28.502) and Mechanics (F =31.468) emerged as sig- 
nificant predictors to Impressionistic scores (see Table 5). 

Relationships Among Rating Scale 
Scores and the Objective Measure 

The correlation between the Objective measure total scores and the 
Analytic scale total scores was .61, while the correlation with Impres- 
sionistic scores was .65 (see Table 6). Correlations between the 
six Objective subtest scores and total scores on the two rating scales 
ranged from .55 to .23. 

To examine the predictive relationship between the Objective mea- 
sure and the two rating scales, the Impressionistic and Analytic seals 
total scores were independently regressed on the six major Objective 
subtests (see Table 7). Results of the regression analyses^ for the 
Impressionistic scale. (F = 12.853,* df = 6,92, 'p .01) and for the 
Analytic scale (F = 10.338, df = 6,92, p .01) indicated that the six 
Objective .subtests jointly accounted for approximately 46% of the vari- 
ation in Impressionistic scale scores arjd for approximately 40% of the 
variation in Analytic, scale scores. Two pbjective subtests, Paragraphing 
(Subtest 4) and Paragraph Analysis (SubtesV 6>", proved to be significant 

4 

predictors to both Impressionistic and Analytic scale total scores. 

: 9 • ~\ 

282 



In an additional perie.3 of analyses, Analytic subscalesj were 
regressed on Analogous groups of Objective items." These calculations 
(reported in Table 8) resulted in significant F ratios for eignt 
of the nine groups of Objective items. 

y ■ 

Discussion 

An interesting finding to emerge from the study concerned the 
pattern of strong relationships across the Analytic subscales. On the 
basis of these data alone, it is tempting to infer that the Analytic 
scale actually tapped a single unitary dimension of writing. Such an 
inference, hoWever, overlooks an important facet of the writing task 
which may have affected |the results— the writing topic and directions. 

The topics and directions of the study were specifically designed 

A * * : ' " 

to elicit writing develop d through , a logically arranged structure of 

Generalizations supported by specifics. Directions accompanying the 

j ' * • 

topics prompted stUents in the f. oil owing way: 

Remember, the purpose of your essay is to give an informative 
explanation. . .Back up your ideas with specific support, such as 
examples, facts, and otner details. Make sure your essay is 
well-organized. - i 

As anticipated, the topics and directions promoted uniformity in the 

rhetorical structures of the majority of students 1 writing samples, of 

course, with varying degrees in quality of execution. Whether diffe- 

/ 

rential sabscale scores will emerge when writing samples display more 
varied structural patterns i£ an area worthy of additional inquiry. 

^Jt is also possible that impressionistic rating scores were indi-^ 
rectly a result of the relative uniformity of thte structural character- 
istics of the essays. Once Impressionistic raters became habituated to 
the structural patterns of the ^majority of writing samples (through 

10 



practice and training), they may have attended to a few prominent fea- 
tures—other than structure—which most noticeably discriminated among 
the essays. 

The preceding notion is supported by the emergence of the two 
Analytic subscales, Mechanics and Support, as predictors to Impression- 
istic ratings. Judgments regarding a writer 1 s command over mechanical, 
aspects of writing can be made independently of the overall structure of 
a written product. Similarly, juagnrents regarding adequacy of support 
for generalizations embedded withirl an essay can* be made without refer- 
ence to the overall structure of a written piece. Given the relative 
conformity of structural patterns of the students' essays, Analytic and 
Impressionistic raters may have inadvertently attended to two readily 
discernible, and thus discriminating, features of the writing samples: 
mechanics and support. 

This interpretation of statistical relations between the two rating 
strategies is not meant to imply that the discourse domain of -exposition 
consists, exclusively of two components, mechanics and support. The high 
correlations between Analytic scores and Impressionistic scores suggest 
that the Analytic subscales did, * in fact, represent recognizable, if not 
necessarily independent, features of exposition. A p\ovocative issue 
presents itself:^ Was the relative lack of independence among the Analy- 
tic subscales— and the high correlation with Impressionistic scores— a 
function of the homogeneity of student responses? Or, are features of 
writing such as development, organization, and main idea actually 
inseparable for purposes o£ rating? 

Not too surprising were findings which revealed moderately positive 
relationships between Objective measure scores and those yielded by the 

11 

284 



two essay rating strategies. The positive relationship was expectea, as 
the Objective items were designed to assess, at the levels of recognition 
and discrimination, those categories of skill ^easured by 'the rating 

strategies at the level of production. , Moreover, previous studies have 

i 

demonstrated that reasonably well-designed objective tests -of writii^ 
invariably correlate with scores on writing samples, a ^phenomenon which, 
commonsensically, can be accounted for by the global constructs of 
language (reading) ability or verbal aptitude. 

Of more compelling interest to competency-based te st d evelopers 
were the patterns of student performance on the Objective measure. 
Given that reading ability was likely to affect test performance, the 
majority of items were developed^witlT^ aim of minimizing reading 
difficulty, e.g., avoiding complex constructions, abstract content, 
advanced* vocabulary. In fact, most of t.he items were designed to require 
students to make discriminations among individual sentenced, rather than 
sentences embedded within prose passages. .*» 

" r 

The relatively high mean performance (see Table 9) of students 
on items requiring discrimination among individual ; sentences suggested 
that many- of the students possessed the writing-related competencies 
being measured, such as selecting details to support generalizations, 
arranging given ideas in a logical orders choosincj statements to develop* 
a given main idea. For many lower-abi'l ity students (as indicated by 



teacher ratings), though, these competencies .were not expressed when 
stimulus competition withirj the task (e.g., number of words 4 and sen- 
tences,* sentence structure) was increased. 

Especially worthy consideration are' properties of items within 
the two subtests (Subtests 4 and 6) which proved to be significant 

♦ 

0 

12 

' ■ 285 



predictors to the essay-rating total scores. Both of these subtests 

required students to make discriminations among statements embedded 

within prosQ passages. Here again, the construct of verbal (or reading) 

ability is a convenient, /but * hardly satisfying, explanation tor a 

statistical artifact. 

The preceding discussion has highlighted some of the technical 

problems and issues associated with the measurement of writing and 

writing- related skills, but what qbout practical implications? Findings 

indicated that the two rating scales provided essentially comparable 

estimates of writing competence. By definition, analytic scales, sufch 

as the one employed, m have a greater potential than impressionistic (or 

holistic) scales to paint a clear picture of students 1 wrvting strengths 

and deficits. As expectfed, the time required to train Analytic raters 

and to analytically rate the essays was^ slightly greater than the time 

required for Impressionistic procedures (approximately six hours). The 

expense, however, is li'cely to be outweighed by the' usefulness of the 

information yielded by analytic procedures. The expl icit nature of 

analytic scales provides instructional decision makers and students with 

» 

clear * information on the domain of learning being measured. Such infor- 
mation is likely to enhance dialogue among teachers, students, and 
administrators on the status of student 'Writing and may serve as a basis 
for instructional planning, diagnosis, and remediation. 

» 

The contribution of indirect measures of writing, however, raises a 
variety of issues regarding further study.- One of the most basic issues 
centers on the nature of the relationship between- direct and indirect 
measurement strategies. As demonstrated* in the present study, as well 
as others, there exists an array of selected response tasks (beyond 

' V 

f 

t 

( 13 ' • 

286'. 



those measuring sentence level skills and writing mechanics) which are 
conceptually and statistically associated with writing production. If 
selected response tasks are to be useful within the context of competency- 
based measurement, though, test developers must employ test items which 
are positively related to instructional efforts. 

Eighteen years ago, Richard Braddock characterized research on 
composition -as 11 . . . laced with dreams, prejudices, and makeshift 
operations' 1 XBraddotk et #1., 1963). It probably is fair to say that 
the* state of coi^osition research has advanced, even accelerated, in 
recent years. This may be, due in large part to growing acceptance of a 
research paradim which views measurement and instruction as complemen- 
tary pursuits. A continuing- challenge to test developers,' then, is to 
identify measurement tasks wbjch, when practiced under appropriate 
instructional conditions, are likely to promote, not simply predict, 
acquisition of writing production skills. 

( 



14 

287 



REFERENCES 



Braddfek, R. , Lloyd-Jones'/ R. , & Schoer, L^' Research in written^ compo - 
sition . Champaign-, 111.: ^National Council of Teachers faf English, 
1963. 

Brooks, C, & Warren, R. Modern rhetoric, shorter edition . New Yorjc: 
Harcourt, Brace ©-World, lp61. 

Godshalk, F. I, , Swfneford, F.,,- & Coffman, W. E. The measurement of 
writing 'ability . New York:* Xollege Entrance Examination Board, 
1966. 

Tillman, J. Criterion-referenced measurement. In W. J. Popham (Ed.)> 
^ Evaluation in education: Current applications . Berkeley, Calif.: 

*" McCutchan, 1974. ' . 

+ -/ 

Office of the Los Angeles County Superintendent of Schools. A common * 
ground for .assessing competencies in written expression, review 
copy . Los Angeles, Division of Curriculum and Instructional 
Services, 1977. 




V 

15 

288 




TABLE 1 



Measurement Strategies 



Type of Strategy: 
Direct Measure of Writing Indirect Measure of Writing 



1. Writing samples -judged by 
Analytic, scoring criteria. 

2. Writing samples judged by 
Impressionistic scoring 

. criteria. - 



3. Objective items requiring 
students to discriminate 
among, given, passages of. 
written discourse- ". ■ - 





ERIC 



16 



* * 



289 



TABLE 2 

Analytic Ratino Scale .(Abbreviated) 



THE ANALYTIC RATING SCALE 

Analytic Seals Elesents 

V 



Essay Focus; The introduction or conclusion €f the essay clearlv . 
indicates the subject and main idea of the whole essay. 

Essay Development: All major subtopics ("main points") clearly ' 
relate to the main idea of the whole essay. cieariy 

.Essay Organization: The main idea is developed according to a 
cieariy discernible method of organization; 7 

Support: Generalizations and assertions are supported by specific ' 
clear supporting statements. * »i«wtiq 

Paragraphing: The essay is' composed of one or more clearly disceroibl 
units of thought, e.g., paragraphs. J ° ,scennD » 

Kachanicsr The essay is free of intrusive mechanical errors. •* 



/ 



IS 

290. 



TABLE 2 .{con't*.) 
Analytic Rating Scale: San-ple Rubric 



ELBffidT 1 
Essay Focus * 



The introduction (if deductively structured) or conclusion (if inductively 
structured) of the essay clearly indicates the subject and main idea of 



the whole essay. 

4. The introduction (and/or conclusion) of this paper clearly conyeys the 

main idea of the whole- essay. It also limits the topic by alerting 
.. the reader to the key points covered ir^the^body of the essay* 

.Specifically^in the introduction (and/or conclusion): 

a. The subject of the ess^y is clearly identified* 

b. The main idea of the whole essay is clearly stated or implied. 

C. The topic is clearly limited. That is, key points (e.g., reasons,, 
ideas) or major line(s) of reasoning treated in the essay are 
Identified or sucw^rized. • 

3. The introduction (and/or conclusion) of this paper conveys tT>e main 
idea of the whole essay. It sets limits on the topic, but .does not 
clearly suggest how the main idea is developed* 

. : Specifically*, in -the introduction (and/or conclusion) : . - 

a* Jhe subtfect-ofthe essay is clearly identified.* 

b. The main idea of the whole essay is clearly statecfor implied.* 

* c. An attempt is made to limit the topic. - That is* the number —' or 
type ~ of key points is specified, but there is not clear refer-. * 
ence to the substantive issues treated in the body of the essay „ 

2. The introduction (and/or conclusion) of this paper gives the- reader a 
fairly clear sense of the main idea of the whole essay. However, 
neither the introduction nor the conclusion help focus — or bring 
direction to the' body of the paper. \ ; 

Specifically, in the introduction (and/or conclusion): . 

4 

a. The subject of the essay is identified. 

b. The main idea of the whole essay is stated or implied. 

c. Ho attempt is made to limit the topic. 



20 

' 291 



■ - ' ' ' . TABLE 2 ^(con't.) 

• - : - * 

ELEMENT 1 (continued) " } . . 

Essay Focus 

* JrS?!^ int *° d{iC V°n "or the conclusion is helpful to the1=ead»r 
in obtaining .any sense of the stain idea of the essay. " 

.* Specifically, in the introduction (and/or conclusion): 

a. The subject of the essay is not clearly identified or there is 
no reference to the subject . 

2 e «!*il* 1<tea 07 the ^ hole essa * is not Nearly stated or implied 
fusing 6 " t0 tte ™ in idaa ' or *** Terence 



21 



9 

ERIC 



f> ^ ' TABLE 3 * 

• > 

. Objective Measure Subtest Objectives (Abbreviated) 

' Objective * 

(5 items) The student will be given a brief paragraph which, fs lacking 
.. . either a topic or concluding sentence, and four alternate r 
' 4 . - • sentences. The student is to select. the sentence which S 
• . would serve as che most appropriate topic or concluding sen- 

tence, i.e., a statement of the paragraph's main idea. 



II. Objective 



(5 items) The student will he given e main idea statement for a multi-' '• 
'•' •paragraph expository essay and alternate statements , that ■ 
... might be 'included in, the .body of the essay. The student is 
*. to select the statement that most directly contributes to 
. development of the given main idea. 

'•-.*."', •' ' . '• ' 

. III. . Objective ^ 

(5 items) . The- student will be given a topic sentence for an expository . 

paragraph and alternate statements that might be included in : 
" . the paragraph as supporting defail. The student is to select 
the statement that does not provide specific support for the 
. . given topic sentence. 

'•-'•.->,•- 

IV. Objective ; . , • .-.' •:' 1 * -\ . 

(5 items) The student will be given a series of five to six lettered .sen - 
: * .tences which express two distinct thoughts (i.e., sub topics) .. 
~* which are related to the same overall main idea. The' student 

1s to indicate^here one complete thought ends and another . 
, -begins, i.e., where a new paragraph could logically, begin. 

*V. . Objective m . • V 

(5 items) The student 



will be given five sentences wjiich could be included^ 
in an expository essay: One statemenpof the- essay's main idea;- 
two "sub topic" ^entences'f "topic" 6?~ "concluding" sentences for 
individual paragraphs within the essay); two supporting details.'- 
The sentences, wiTl be given in scrambled, order. The student' is • 
to. indicate a logical order for the sentences. " 



24 

293 - 



TABLE 4 " . • 
Regression of Impressionistic Scale on Analytic~Scale 

REGRESSION OF IMPRESSIONISTIC SCALE SCORES ON 
SIX ANALYTIC SUBSCALES 



Analytic 


Unstandardized 


Standard' 


"Standardized 


A 


OUUbLa 1 co 


uocTncienu 


error pf 8 


Coefficient 


F 


Mechanics 


" .585 


.106 


• • .424 


^ 

30.789* 


Focus 


.233 


.147 


.196 


2,537 . 


Development 


.046 . 


.160' 


y .040 


.082 

» 


Organization 


.293 


-.182 


.265 


2.492 • 


Support 


.514 


. .120 ' 


.424 


18.365* 


Paragraphing 


.033 


..148 


.028 ' 


.048 


df = 1,105 . 



r 2 = 
R = 



p < .01 

.75 

.87 



. ' TABLE 5 

REGRESSION OF IMPRESSIONISTIC SCALE SCORES ON 
• THREE ANALYTIC SUBSCALES 



9 

ERIC 



Analytic 
$ Subscales 



Mechanics 
Support 4 
Structtjrev 



Unstandardized 
Coefficient 



Standard Standardized, 
Error of B Coefficient -F . 



.582 
.560 
.017 



.104 
.105 
.028 



.422 
.462 
.056 



31. $68* 
28.502* 
.370 



df = 1,108 

* = p < .01 , . ' - 

Structure = Composite score of Organization, Focus, Development', Para- 
graphing 

r 2 * .74 



R * .86 



26 



294 




■ TABLE 6 • ■ 

CORRELATIONS AMONG ANALYTIC AND IMPRESSIONISTIC TOTAL SCORES AND 
• OBJECTIVE MEASURE SCORES 









i 

i 


II 


III • 


zv 


V 


VI 

Para, 
Anal- 
ysis 


A 


B 


■ «C D 


• 


Imp. 
Total 


An, 
Total 


Main 
Idea 


Devol- 

opment 


Supt. 
Detail 


Para- 
graph 


Organ- 
ization 


Main . 
, Ideal 


Devel- 
opment 1 


total 

Organ- Supt. Test 
ization* Detail 1 


Impressionistic Total 


1.00 


sr 










N 










*< Analytic Total 


.81 


1.00 




















• Objective Subtests ' 






• 






* 












i. natn idea 


.24 


,23 


1.00 


















II. Development 


.46 


.37 


.17 


1.00 
















III. Supt. Detail 


,35 


,35 


• n' 


;35 


•1.00 














IV. Paracjrap^ 


. .55 


.49 


.29 


.43 


.37 


1,00 


t 










V. Organization 


' ,45 


.42 


,26 


.38 


' .32 


,39 ' 


1.00 


* 








VI, Para, Analysis 


' . ,53 


,53 


.36' 


■: .47 


' .35. 


,43 


.43 


1,00 








" A. Main Idea 1 


,30. 


• ,35 


•.,24 


-.36 


♦ 12 


,23 


.12 


,72 


LOO 






B. DevcloD- 
- . ment 1 


.40 


.40 


"',26 


.33 


.30 


.35 


.37- 


.74 


.40 


1.00 




C, Organization 1 ' 


.39 


,33 


. .32 


..35 


,35 


.32 


'.33 


.75 


.44 . 


.34 


1.00 


0. Supt. Detail 1 ' 
Objo'ctrve -Total Test ' 


.47 

• .65 


' .51 
. ,61 


'.21 ' 
,53 • 


.33 

,66 


,24 

,55 


.33 
' .60 


.42 
.68 • 


.71 

,,05 


.30 

.54 ' 


,30 
,66 


.44 1.00.. 
• .67 ,63 1.00 



1 * Items within Paragraph Analysis 



tABLE" 7 . . 
Regression of Rating Scales on Objective Measure 



REGRESSION OF ANALYTIC SCALE TOFAL ON SIX 
OBJECTIVE MEASURE SUErESTS 

1 ' 1 iii ■ ..I i 



Objective Measure 
'Subtests 


Unstandardized 
Coefficient 


Standard 
Err?*" c s s 


Standardized 
Coefficient - 


F 


Main Idea 


•141 


1.3C0 


009 


fli? 
*uxc • 


Development 


.428 


1.595 


.026 


.07?. 


Supporting Detail 


1.502 


1.659 


.082 


.811 


Paragraphing 


4.037 


1.514 


.260 


7. 105* 


Organization 


1.771 


1.173 


.142 


2.277 


Paragraph Analysis 


2.184* 


.574 


- .330 J 


10.488*- 


df » 1> 92 










* 3 p < .01 










R *\6S 






• 


■ 


r 2 * .40 








• 


' REGRESS IOM OF IMPRESSIONISTIC SCALE TOTAL Oil SIX 






OBJECTIVE MEASURE SU3TESTS 






Objective Measure * 
Subtests 


Unstandardized 
Coefficient 


Standard 
Error of 8 


Standardized 
Coefficient 


F, . 


Main Idea ' \ 


.025 


.274 . 


.007 


.008 


Development - 


.497 


.336 - 


-- .137 


' 2.182 


Supporting Detail 


•209 


,352 


.- .051 


.354- 


Paragraphing ° 


1.05 


.319 


.310 


11.088* 


Organization 


•421 


.248 


.153 * 


2.894 


Paragraph Analysis 


.371 


.142 


.254 . 


6.08*- 



df =» 1, 92 
* '' .* p < .01 
R » .68 
r 2 » .46' 

/ 



28 



297 




20 
CD 
lO 
-* 
0> 
00 

in 
O 



REGRESS I&N OF ANALYTIC SUBSCALES ON ANALOGOUS OBJECTIVE MEASURE SUBTESTS 



Analytic Subscale 


Objective Subtest 


Unstandardized 
Coefficients 


Standard 
Error of 0 


Standardized 
Coefficients 


F 


JJL 


JL 




I, Focus 


Main Idea 


.518 


.269 


.188 


3.709* 


1.96 


f 


.14 




Main Idea 1 


1.051 


.375 


.273 


7.826* 


1,96 




lit Development 


Development 


.660 


.307 


.213 


4.619* » 


1,96 


.40 


.17 




Development* 


.953 


.333 


.203 


0.171* 


.1,96 






211. Support 


« Supporting Detail 


.545 


.313 


.161 


3.026* 


1,96 


.48 


.23 




• Supporting Detail 1 


1.548 


.347 


. ".412 


19.096* 








IV. Paragraphing 


Paragraphing 


• 1.303 


.269 


.442 


23.511* 


1,97 


.44 


.20 


V. Organization 


Organization 


.905 


.360 


.360 


13.516* 


1,96 


.42 


.10 




Organization 1 


• ,514 


.131 


' ,131 


1.797 ' 


1,96 







1 ■ items within Subtest VI, Paragraph Analysis 
* » p < .01 



> 

, — ; 
rt 



33 
a» 
rt 

(£3 

CO 
C 

cr 
ui 
o 
& 
— i 

0) 
</> 

o 

ta 
•i 
o 
c 
•o 
v\ 



o 

CT 

o 



< 



CD 



m 

co 



ALTERNATIVE SCORING SYSTEMS FOR PREDICTING 
\ CRITERION GROUP MEMBERSHIP 



Lynn Winters 



Center for the Study of Evaluation 
UCLA Graduate School of Education 
Los Angeles, California 90024 



The project presented or reported herein was performed^ pursuant to a 
grant from the National Institute of Education, Department of Education, 
However, the opinions expressed herein do not necessarily reflect the 
position or policy of the National Institute of Education, and no 

official endorsement by the National Institute of Education should be 

_• *~ ■ * 



inferred 



X2CSE/A 



•300 



ABSTRACT 

The purpose of this study was to describe the effects of using 
competing scoring systems on classification decisions involving high 
school and college writers. Two questions were of interest: 

1. What are the comparative reliabilities of different scoring 
systems? \ 

2. What are the comparative validities of different systems for 
classifying writers into the appropriate levels of writing 
skill? 




! 



301 

\ 



THE EFFECTS OF DIFFERING RESPONSE CRITERIA 
ON THE ASSESSMENT OF WRITING COMPETENCE 



The methol selected for judging writing samples ipso facto defines 
"good writing." However, decisions about writing competence derived 
from one sqoring system may not lead to the same decisions about an 
examinee when another scoring system is us^d. The effect of using 
alternative methods of judging compositions on the classification of 
^examinees into a priori defined levels of writing skill has not been 
investigated. The purpose of this studvyas to describe the effects of 
using competing scoring systems on^classification decisions involving 
high school and college writers. x Two questions were of interest: 

1. What are the comparative reliabilities of different scoring 
systems? 

2. What are the comparative validities of different systems for 
classifying writers into the appropriate levels of writing 

1 skill? • 

The Validation of Scoring Systems 

Typos of Scoring Systems 

Scoring systems are generally of three types: impressionistic, 
analytic, or frequency counts (Braddock et al., 1963). General impres- 
sion narking is the system whereby two or more raters quickly read an 
essay then assign a single score ranking the writing sample in relation 
to all other papers being evaluated. The several ratings assigned by 
different scorers are averaged to obtain a reliable (i.e., stable) 
estimate of a writer's ability. Several sorts of general impression 
marking are popular and incorporate rubrics of various^! evels of 



302 



specificity and different score rapges. The one feature shared by 
impressionistic systems is the inclusion of rater training prior to 
scoring in order to "calibrate" readers in the assignments of marks 
(Conlan, 1976). 

Analytic scoring systems, in contrast to general impression marking, 
require careful scrutiny of th2 writing samples. While impressionistic 
systems are predicated on the assumption that good writing is cleanly 
recognizable, analytic systems assume that quality writing is character- 
ized by the inclusion of certain rhetorical elements. One of the most 
popular of the analytic rubrics is the ETS Composition Scale, often 
called the Diederich Expository Scale.. This scale emerged from a study 
that investigated factors influencing human judgment of expository 
writing samples. A factor analysis of patterns of rater agreement, 
combined with a review of rater comments on the essays belonging to each 
of five obtained clusters, was used to form the categories for the 
Diederich Expository Scale (DES>. 

The third type of scoring system is the frequency count. Whereas 
general! impression and analytic scales focus on the quality of an essay, 
frequency counts are concerned with the quantity of such linguistic ele- 
ments as the number of unique words in an essay, the number of sentences, 
or the number of words per independent clause. The frequency counting 
method with perhaps the most potential for judging levels of writing 
skill is the T-unit. T-units are the shortest grammatically complete 
sentences that a passage can be cut into without creating fragments 
(Hunt, 1977), Studies investigating the relationship cf T-unit length 
to some measure of writing quality have tended to support the hypothesis 



EMC 



303 



3 



that a positive relationship between the two exists (Howerton, 1977; 

a 

Hunt, 1977; O'Hare, 1973), 

A number of studies (Follman . Anderson, 1967; O'Hare, 1973; 
Howerton, Jacobson, & El den, 1977) have found relationships among the 
three types of scoring systems. These findings have led researchers to 
hypothesize that different scoring systems measure a number of elements 
in common (Follman & Anderson, 1967). However,, a simultaneous compari- 
son of impressionistic, analytic, and frequency count systems has not 
been done. The different assumptions about writing skill implicit in 
each type of rubric suggest that high intercorrelations may not exisl^ 
If such is the case, their interchangeable ity for judging writing / 
competence is -impugned. 

Validation Procedures 

There have been few attempts to v^Hdate direct measures of writing" 
skill (i.c : sets of response criteria) beyond the level of content or 
descriptive validity. The content validity of the' scoring criteria used, 
in NAEP and the College Entrance Examination Board's Advanced Placement 
Examination was established by "expert judgment' 1 " (Lloyd-Jones, 1977). 
Neither of these well-established systems has been investigated as to 
its ability to predict success on subsequent tasks or classify examinees, 
issues in empirical validation. 

While few guidelines exist in the literature on writing research 
for the Empirical validation of scoring systems, techniques for estab- 
lishing empirical validity have been described in the context of 

criterion-referenced testing (Hambleton et al.j 1977; Harris, 1974; 

J 

Millman, 1974). Criterion-referenced rreasurement technology offers two 
major strategies for investigating empirical validity: (1) the use of 



tl& . 304 



ERIC 



.scores on established .tests as a criterion to which newly developed test 
items should predict, and (2) the use of the performance data of cri- 
terion groups to- validate the accuracy of, classification decisions to be 
made from the test. 

The use of criterion groups to investigate the empirical validities 
of different scoring criteria provides ar^ a priori classification system 
against which the criteria can be tested, i.e., the ability to sort 
examinees representing different levels of writing skill into the cor- 
rect writing groups. . The criterion groups in this study were chosen to 
contain low and high performance high school and college writers. These 
four groups provided an operational definition of the construct "writing 
competence. 11 The four scoring systems validated were selected to repre- 
sent the three types of scoring criteria: impressionistic, analytic, 
and frequency counting. Two types of analytic scales, the Diederic£ 
Expository Scale and the CSE Analytic Scale, were testefi. Tha Diederich 
Scale cpntains both impressionistic elements and elements tied to the 
requirements of exposition. The CSE Analytic Scale was derived from 
current composition textbooks and emphasizes the elements of writing 
taught in the schools. T-unit analysis was the method selected to 
exemplify the frequency counting approach. It was chosen because it, was. 
validated against qualitative measures of writing and was expected to 
have elements in common with impressionistic and analytic scales. 

Procedures 

Assignment of Papers to Raters 

Two designs were used for assigning student papers to raters, one 
for scoring sample papers used to evaluate the effectiveness of rater 



305 



training (Pilot Studies) and the second to examine the writing perfor- 
mance of the four^ criterion groups across the four scoring systems 
(Ffnal Papers). Eighty subjects^ produced two expository essays on two 
test occasions, one week apart. Ty/o sets of four raters were trained in 
two scoring systems and graded essays after each training session. 

The Pilot Studies were used to insure that all raters were assign- 
ing essay scores consistent with each other, to investigate topic effects, 
and to obtain estimates of general izability for several research designs 
for the final rating of papers. These studies were conducted four 
times, once after each rater training session. Twenty- four papers, 
chosen to represent all four criterion groups and both topics, were 
scored by all four raters. Two analyses of variance were run on the 
pilot scores, one to obtain intraclass coefficient alpha and the other 
to obtain estimates of variance components for subjects, topics, raters, 
and their interactions under a random model, fully cross^J design. The 
coefficients calculated from these analyses are displayed in Table 1 
below, 

* 

/ Table 1 

Alpha and General izability Coefficients: Pilot Data 



Coefficient and Condi th' on 


GI 


Scoring System 
DES CSE 


T 


Alpha:* 4iRaters— Topics Fixed 


.97 


' .82 


.96 


.98 


General izability: 4 Raters, 2 Topics 


.87 


.88 


.80 


.65 


Predicted G: 2 Raters, 2 Topics 


.85 


.79 


.78 


.47 


Predicted G: 2 Raters, 1 Topic 


.85 


.87 


.85 


.68 



306' 



Results of the pilot studies indicated that fairly high reliability 
could be obtained under a nested design with two raters reading one 
paper per examinee. The finaVrating design crossed subjects and topics 
with raters, but nested raters within topics. Each rater read only 80^ 
papers, one from each subject, 40 of which were Topic 1, and 40 of 

4 » 

Topic 2. Subjects and topics were randomly assigned to raters. 



. Figure 1 
Assignment of Writing Samples 



Rater- 



HSL 
n=20 



Group 
HSH 
n=20 



Coll L 
n=20 



Coll H 
n=20 



• 



'1 

n=10 
T 



2 

n=10 



'1 

n=10 



'2 

n=10 



'1 

n=10 



'2 

n=10 



'1 1 
n=10 



'2 

n=10 



Subjects 

The subjects of the study were high school and college students 
enrolled in composition classes in Oune 1981> Twenty (50)' sets of 
writing samples were randomly selected from papers obtained from writing 

V 

Glasses at a suburban, worki^-class high school and a major urban 
university. Classes were chosen to represent two levels of writing 
skill, higlr-and low, for both high school and college populations. , The 



ERIC 



•307 



students whose samples 'were selected became the criterion groups 
described below: 

1. Low Performance High School Writers (HSL) 

Students enrolled in basic composition classes at the eleventh 
^ <grade level who do not intend to attend a university. 

2. High Performance High School Writers (HSH) 

Students enrolled in eleventh grade Advanced composition 
classes who plan to attend a university. 

3. Low Performance College Writers (Coll L) 

Students enrolled in remed^l college composition at a univer- 
sity by virtue of having failed a written English placement 
examination. j 

4. High Performance College Writers (Coll H) 

Students enrolled in freshman composition at a university 
after having passed the written placement examination or a 
course. in remedial composition. 

Writing Tasks 

Subjects were given 50 minutes to produce a 200-word expository 
sample on two test occasions, one week apart. Two parallel topics were 
randomly assigned to subjects and counterbalanced to control for test 
occasion. Dictionaries were not available, although subjects were told 

to revise papers if time allowed. 

♦ 

Controls 

— - — ■ — * ) 

♦ The design did not attempt to contrpl for the instructional history 

of the subjects or the effects of test occasion. It did, however, > 

attempt tc control for the following: (1) topic effects, (2) rater 

background, and (3) inter-rater disagreement. Table 2 and Figure 2 

summarize the steps taken to control sources of variation due to topics 

and" raters. 

308- 



8 



Dependent Measures and Ra^er Trai-ninq 

There were four dependent .measures fcTr each subject. These con- 
sisted of each student's score over two topics and two raters for each 
of the four scoring system^. As the validity of these measures was 



1. 
2. 



4. 
5. 

6. 



Table 2 

A Summary of Controls for Sources of Variation 
Due to Topic and Rater 



Examinee scores were averaged across topics. 

A stratified random sampling procedure was used so that each rater 
received one writing sample from each of the 80 subjects and an 
equal number of papers on each topic. 

Papers were arranged in rater packages so that each rater received 
papers in random order. Additiot al ly, rater packets were created 
so that there was no systematic pairing of raters. <» 

Writing samples were typed and assigned random identification 
numbers to obscure examinee name and criterion group status. 

To minimize reactivity, scoring systems were assigned to training 
conditions so that the two analytic scales were separated and • 
general impression, marking introduced first. 

Each training condition consisted of four raters, two high school 
and two college teachers. All raters learned at least two systems. 
Two raters were tVained in all four systems. 



Figure 2 

•'Assignment of Raters to Training Conditions 



* 



Rater 1 
Rater 2 
Rater 3 
Rater 4 



'Session 1 
GI & DES 



Session 2 
CSE & T 



College remedial £eachar Rater 1 

College composition Rater 2 

High school English Rater 3 

High school English Rater 4 



College remedial teacher 
College composition 
High school (English 
High school English 



9 

ERLC 



309 



* directly dependent upon their reliabilities, raters were trained to use 
each system before being allowed to score essays. 1 The general procedure 
forVater training in each system follows: (1) Rate.rs we>re given mater- 
ials- explaining the purpose of the scoring system, copies of the topics, 
and practice papers representing samples of each of the criterion groups 
(2) Raters discussed the 'rubric and defined scoring categories among 
themselves. (3) Practice papers were scored and discussed until raters 
felt they could agree on the assignment of marks- (4) Raters read 24 
sample papers for a pilot study (described in a previous section). 
(5) When the alpha coefficient from the pilot data was .80 or better, 
raters were deemed "trained 11 and allowed to score the actual writing 
samples. 

In investigating the classification validities of the scoring 

sytems, two issues were involved: "Which systems discriminated the best 

among groups? And which systems classified subjects most accurately? 

For each of these issues, the dependent mea$ures differed. The followr 

ing outline summarizes the variables of interest for each question: 

I. Variables of Interest in Finding the System(s) Which Best Discri- 
minate Among Groups 

4 

A. Independent Variables: Criterion Groups 

B. Dependent Variables: Scores Obtained or* Systems 

1. 61: Score Scale 1-6 

?. DES: Score Scale 5-45 

3. CSE: Score Scale 6-24 

4. T: Score Scale 8-34 words per T-unit 

II. Variables of Interest in Finding Scoring Systems Which Best Predict 
Criterion Group Status 

A. Independent Variables: Scoring Systems as a Treatment 

B, Dependent Variables: Criterion Group Membership 

310 



10 

Analyses 

Several preliminary analyses were necessary in order to select the 
appropriate procedures for analyzing data related to the question of 
classification validity. Rater agreement a, J general inability coeffi- 
cients were calculated. The relationship among scoring systems was 
examined with Pearson Correlations. Finally, univariate analyses of 
variance for each scoring system were run to identify those systems 
which .distinguished, among writing groups (see Tables 3-6). 

The results oAthese analyses indicated that while all systems had 
high alpha coefficients, the- reliability coefficients were somewhat 
lower and not as comparablje. T-unit, which had the highest alpha (.99) 
could, not be investigated with a 6 coefficient due to unusual subject by 
topic interactions. DES had the highest generalizability coefficient 
(.85) and an alpha of. SO. <GI and CSE were comparably reliable with Gs 
of .63 and,. 67 respectively and alphas of .81 and .97. Reliability 
dropped from the predicted pilot coefficients. Finally, unusual subject 
by topic effects of a high magnitude were found. ? These could not be 
explained because the research design confounded topic with occasion for ' 
the individual examinee and rater with topic. 

Intercorrelations among the three systems for which significant 
differences among writing groups were found (GI, DES* CSE)* at first 
appeared quite high. However, when the data were analyzed by criterion 
group, the relationship among these systems fluctuated from mild to 
weak. T-unit analysis did not measure what is conventionally described 
by holistic systems (GI) or analytic criteria (DES, CSE) as good wHiting. 
Its negative relationship to other systems for some of the criterion 

311 



11 



Table 3 

Data for Estimating General izability Coefficients: Final Papers 



Estimated Variance 



Source . SS 


DF 


MS 


UUfnpUllcll Lb 






(a) General Impression 










Subjects 224.286 


57 


3.935 


.418 






Topics 3.445' 


4 




^. 0.000 


G= 


63 


Raters within Topics 3.941 


8 


.492 


.003 






Subjects X Topics 55.304 


57 


.970 


.325 






Residual 36.556 


224 


.321 


• of X 






Subjects by Rates within Topics 










(b) Diedench Expository Scale 








Subjects 9649.122 


52 


185.560 


33.473 






Topics 647.993 


4 


161.998 


0.000 


G=. 


85 


Raters within Topics 76.227 


8 


9.528 


.073 






Subjects X Topics 1202.453 


52 


23.124 


• 8.852 






^Residual 563.693 


104 


5.420 








Subjects by Rates within Topics 












(c) CSE Analytic Scale 










Subjects 2447.302 


62 


39.473 


4.400 






Topics 30.800 


4 


7.7 


0.000 






Raters within Topics 23.092 


8 


2.887 


.025 


G=. 


67 


Subjects X Topics 545.944 


62 


8.806 


3.773 






Residual 156.376 


124 


1.261 


1.261 






Subjects by Rates within Topics 












(d) T-Unit Analysis 










Subjects 1370.150 


51 


26.866 


0.000 






Topics 290*958 


4 


72.740 


.002 






Raters within Topics 2.042 


8 


.255 


.001 


G=Undef 


Subjects X Topics 1488.059 


51 


29.178 


14.493 






Residual 19.537 


102 


.192 


.192 






Subjects by Rates within Topics 













* Negative variance components are reported as zero. 



312 



) 

Table 4 

Intercorrelations Among Scoring Systems 
for Total Sample (N=80) 

* 

« 



Syste^ 




GI 


DES 


Lit 


T-yni ts 




GI 

DES 

CSE 

T-Unit 




.82* 
.79* 
.00 


.86* 
.06 


.00 






* Indicates significant correlations at p 

• 


.05, df 18. 








Intercorrelations Among Scoring Systems 
by Criterion Groups (n=20) 






4 


Low 






High 






GI 


DES ' 

*> 


CSE 


GI 


DES 


CSE 


High School 




« 


~© 








GI 

DES 

CSE 


.64* 
.71* 
.28 


.79* 
.33 


.30 


.54* 
.44* 
.00 


.30 
.24 


.21 


College 














GI 
DES 
CSE 
T 


.53* 
.11 
-.07 


.37 
-.39 


-.52 


.51* 
.70* 
-.31 


.60* 
-.23 


-.52 



* Indicates signficant correlation at p .05, df 18. 



313 



> 3 



Table 5 

Criterion Group Performance Across Scoring Systems: 
Means and Standard Deviations 



System 






HS Low 
n=20 


HS High 
n=20 


Coll Low 
n=20 


Coll High 
n=20 


Total 


GI 




X 


=*1.97 


3.21 


4.01 


3. 33 


. 3.02 


t I ota i 


o) 




— Eft 


. 65 


. 72 


. 74 


1. 00 


DES 




X 


=13.25 


25.66 


28.00 


26.00 


23.29 


(Total 


45) 


S.D. 


"2.95 • 


2.91 


3.26 


4.09 


. 6.63 


CSE 




X 


= 9.84 


15.50 


17.07 


16,18 


14.65 


(Total 


24) 


S.D. 


= 1.57 


1.92 


v 2.08 


2.83 


3.55 


T-unit 




X 


=14.18 


15.07 


13.99 


14.89 


14.52 


(Range 




S.D. 


= 3.45 


5.18 


1.95 


2.05 


" 3.38 



10-34) 



Table 6 

Analyses of Variance for Each Scoring System: 
Criterion Groups as Independent Variables , 



System 


Source 


SS 


DF 


MS 


F 


Eta 2 


GI 


Between 


752.637 


3 


250.879 


36.094 


.59 




Within 


528.250 


76 


6.951 






DES 


Between 


42930.338 


3 


14310. 113 


80.522 


.76 




Within 


13506.550 


76 


177.218 






CSE 


Between 


1028b. 700 


3 


3426. 090 


46.320 


.02 




Within 


5622.500 


76 


73.980 






T 


Between 


263.740 


3 


87.91 


• .470 


.02 




Within 


14205.085 


76 


186. 90 







ERIC 



314 



14 



groups suggested that, in the present study, information gained from 
T-unit 'scores would be of little benefit in predicting expository writing 
performance as defined by criterion group membership. It was therefore 

not used in later analyses. 

v 

The classification validities 'of the systems were investigated by a 
simple ranking procedure and with discriminant analysis. The discrimi- 
nant analysis' (see Tables 7 and 8) revealed only one significant Linear 
Discriminant Function (LDF) for each of two combinations of scoring 
systems, GI/DES/CSE and DES/CSE. An examination of the w,ithin-group 
correlation matrices and the standardized LDFs indicated that DES possi- 
bly contributed most to group separation when it was combined with either 
CSE or GI. Both DES and CSE separated group centroids better than GI. 

Table 7 

Within-Groups Correlation Matrix for Discriminant Analysis 



System 



GI 



DES 



CSE 



GI 
DES 
CSE 
T 



.54 
.48 
.01 



.55 
.04 



-.04 



Table 8 



LDF Coefficients for Two Discriminant Analyses: 
§ GI/DE/CSE and DES/CSE 



Eigen Relative % of Information Wilkes 1 Standardized 
Value Accounted for by LDF Lambda LDF Coeff. 



'.Analysis 



GI/DES/CSE 
DES/CSE 



3.388 
3.374 



96.69 
99.69 



.204 
226 



GI-0.61 

DES-0.758 
CSE-0.214 
DES-0.793 
CSE-0.232 



9 

ERJC 



315 



15 

Discriminant classification revealed a 61% accuracyfof group place- 
ment when three systems were used (GI/DES/CSE) and a 59% accuracy with 
the two analytic systems. Classification accuracy was greatest for the 
High Ichool Low group (95%> and least for the College High group (20%). 
Prediction was better for the Low writers in each level of school than 
the high. Mi sclassifi cation tended to be from College High to College 
Low and College Low to High Schoo] High. The High School High group was 
misclassified as College High when only the twr systems were used. When 
three systems were used, misclassifications were almost equally distri- 
buted between the two college groups for the High School High writers. 
The most accurate discriminant classification analysis is reproduced in 
Table 9 below; 

Table 9 

Classification Accuracy Using GI, DES, CSE scoring Systems 



Actual Group 


N 


HSL 


Predicted Group Membetehip 

HSH Coll L * Coll H 


High School Low 


20 


95% 


5% 


0% '0% 


High School High 


20 


0% 


55% 


10% 35% 


College Low 


20 


0% 


25% 


70% 5% 


College High 


20 


10% 


20% 


45% 25% 



Percent of cases correctly classified: 61.25% 



The frequency distributions used for examining group classification 
for each scoring system separately displayed the same trends as the dis- 
criminant classifications (see Figures 3-6 and Table 10). For all 
systems but T, the High School Low group was clearly discriminate from 
the other three groups. There was much intermingling among* the other 
groups, although DES appeared to separate the H.igh School High and 
College Low groups better than did the other two (GI/DES) effective 

316 



18 



Table 10 

Classification Accuracy for Scoring Systems 



Predicted Group Membership 
Actual Group „ HS| _ HSH Coll L Coll H 



(a) General Impression 

High School Low 
High School High 
College Low 
College High 

Overall GI Accuracy = 46% 

(b) Diederich Expository Scale 



20 


81% 


19% 


0 


0 


20 


15% 


36% 


46% 


0 


20 


0 


21% 


29% 


56% 


20 


5% 


28% 


39% 


36% 



High School Low 


20 


93% 


8% 


0 


0 


High School High 


20 


2% 


46% 


38% 


16% 


College Low 


20 


0 


17% 


35% 


48% 


College High 


20 


7% 


3% 


49% 


37% 


Overall DES Accuracy = 


53% 










(c) 


CSE Analytic 


Scale 






High School Low 


20 


88% 


13% 


0 


0 


High School High 


20 


3% 


46% 


36o 


51% 


College Low 


20 


0 


23% 


36% 


51% 


College High 


20 


10% 


19% 


39% 


33% 



y Overal 1 CSE Accuracy = 51% 

(d) T-Unit Analysis 

High School Low 
High School High 
College Low 
College High 

Overall T-unit Accuracy = 24% 



20 


38% 


18% 


15% 


30% 


20 . 


30% 


17% 


28% 


25% 


20 * 


25% 


34% 


16% 


25% 


20 • 


8% 


32% 


41% 


20% 



317 



if 



scoring systems. With all systems but T, it appears the High School Low 
subjects are assigned the bottom third of the scores with the other 
criterion groups almost normally distributed about the top two-thirds of 
the score rankings and virtually indistinguishable from each other. 

Unexpected findings include the impotence of the T-unit to discri- 
minate or classify, as well as its non-existent relationship to qualita- 
tive scoring systems. Also unexpected was the wide divergence in types 
of writing samples receiving high and low ratings from each system, A 
perusal of the actual essays, more than any other analysis, illustrated 
the differences among the four systems. There was an unanticipated 
reversal in group means for the College Low and High groups. Finally, 
the inability of the systems, acting singly or in concert, to clearly 
separate all four groups was unanticipated. 



Classification Validity 

In order for a scoring system to accurately classify examinees into 
the correct writing groups, it must be able to distinguish among the 
groups. It was found that three of the systems, General Impression, 
Diederich Expository Scale, and CSE Analytic Scale, were associated with 
differences in group performance. T-unit was not, making 1t virtually 
useless for classification purposes. 

When the three "discriminating" scoring systems (GI, DES, CSE) were 
used to classify examinees into the proper criterion groups, it was found 
that three systems classified more accurately than two. This finding 
was predictable in that classification accuracy is often enhanced by the 
use of several sources of information. Another explanation for the 



Conclusions 





advantage of using three scoring systems to classify examinees is tied 
to the kinds of scoring systems used in this study. Two analytic and a 
general impression system were retained for the discriminant analyses. 
Although one analytic system (DES) contributed most to grpup separation, 
the best prediction was obtained with a combination of DES and 61 scores. 
It may be that accurate assessment of writing must incorporate both 
impressionistic and analytic strategies- Support for this interpretation 
is gained when it is remembered that the one system that appears as the 
strongest variable is the DES, a system combining both general impression 
and analytic elements. The CSE scale does no^ lead to as accurate 
classification of group membership as DES. This result may be due, in 
part, to the fact that the scale provides no rating category for an 
overall impression of the effectiveness of the essay. 

These interpretations must be accepted with caution, however, due 
to the unusual interconnections of the systems. - There are groups by 
scoring system differences which may diminish the possibilities that any 
one scoring system can be an accurate predictor for all four groups. It 
is quite apparent that either holistic or analytic methods are more " ' 
effective for distinguishing among writers of several abllityvlevels and 
within a restricted age range. 

The results of the classification analyses indicate that it is 
fairly easy to identify the lowest ability writers in a group and even, 
perhaps, some of the highest, but most difficult to separate writers of 
medium to high ability. 

High school grouping practices provide a reason for the better 
classification accuracies obtained for the high school than college 
groups. In high school, students are assigned to English classes on the 



319 



21 

basis of previous grades, ttecher recommendation and post high school 
educational plans. The high school environment is smaller; teachers 
often (enow students well enough to "counsel 11 them into the appropriate 
English classes. On the other hand, the university composition place- 
ment decision is made on the basis of an impersonal, one-shot examina- 
tion scored on criteria unknown to *cudents and teachers. This proce- 
dure greatly increases the possibility for mistakes in classifying 
examinees. 

Classification accuracy is best for the low writing groups in both 
high school and college. It may tje that the characteristics of "low" 
writing performance are more stable and discernible than those of "high" 
writers. To paraphrase Tolstoy, poor essays are very much alike, but 
good essays are outstanding in their own ways. 

Recommendations 

Several limitations constrain the general izability of this study's 
findings. Scores reported for each system were obtained by raters who 
had participated in training. Rater training causes persons with poten- 
tially divergent views to "agree to agree" on both ttie meaning of the 
scoring rubric and the meaning of the scores. Different raters could 
agree with each other but disagree with the interpretation of the scoring 
rubric! and score assignments reported in this study. The norming pheno- 
menon, though clearly unavoidable, is rarely reported in writing research. 
The lesson: scoring rubrics never specify the entire set of criteria 
used to judge an essay. .There is always that "x" factor which stands 
for now a rater interprets the rubric. 

A second limitation to the general izability of results has to do 
with the subjects. Participants in this stucty do not represent a random 

320 



22 

sample of high school or college writers. They do represent a sample of 
typical writing groups about which decisions must be made for placement 
into writing programs. 

Finally, the topics used in this" study, which appear to have pro- 
duced some unusual effects on writing performance, were expository. 
There is no evidence that other modes of discourse or even other genres 

of expository topics would have produced the same results. 

> 

In spite of these limitations, several recommendations, both 
methodological and substantive, emerged from the study. Several pro- 
cedures used, facilitated interpretation of results. The use of a 
generalizability study during pilot ratings made it possible to identify 
conditions responsible for inconsistencies in score assignment. The G 
study allowed various designs for the final study t6 be considered in 

terms of the trade-off between reliability and cost-effectiveness. The 

< 

calculation of both rater and G coefficients clarified the amount of 
rater disagreement contributing to variation in examinee scores attribut- 
able to error. Rater training provided valuable insights as to how 
human factors interact with scoring criteria and essays to produce a 
final estimate of writing ability. Finally, a review of representative 

essays revealed how scoring systems differed in thetr definitions of 

r 

"good writing." 

Substantive results point to testable generalizations about the 
comparative validities of the four systems for describing and classify- 
ing the writing performance of high school and college students. General 
Impression scoring, a widely used screening procedure wh&h is speedily 
trained and scored, may not be the best system' for making placement 
decisions or for graduation competency. The lack of descriptive power 

321 



of its rubric severely restricts the instructional utility of this 
system whfen compared to^ analytic scoring techniques* Results of this 
study indicated that, while not sufficient as a placement tool, GI 
scores may be a necessary part of. any writing assessment procedure. 

The Diederich Expository Scale, while requiring extensive training 

/ 

time,, appears to be the best system for distinguishing Between hfgh and 
low performance writers at both the high school and college levels. Its 
mild correlations with GI and CSE analytic scores indicate that the DES 
may be providing two sorts of information, impressionistic as well as 
analytic. It was found that discrimination and, classification accuracy 
were increased when the number of data sources were increased. These 
findings suggest that the complex construct "writing skill 11 might best 
be assessed by more than one approach. 

The use of the CSE scale to distinguish among examinees at different 
writing skill levels is a highly reliable and promising technique. The 
fact that this scale contributes less to group discrimination than the 
DES is possibly explained by its lack of attention to any of the quali- 
tative aspects of writing. A revision of the scale with a category for 
the holistic aspects measured by GI may increase' the classification and 
descriptive powers of this system. 

The ability of measures of syntactic maturity to assess writing 
quality is seriously questioned by the results of t|*H study. The lack 
of correlation of T-unit scores with three qualitative systems indicates 
that claims for a relationship between writing quality and syntactic 
maturity^ at least for writers within a restricted age range, must be 
seriously re-examined. 



< 



322 



24 



References 



Braddock, R. , Lloyd-Jones, R. , & Schoer, I. Research in written 

composition . Urbana, Illinois: National Council of Teachers*-of 
English, 1963. J 

Ccxnlan, G. How the essay in the CEEB English test is scored . Princeton, 
N.J.: Educational Testing Service, 1976. 

Follman, J. C. , & Anderson, J. A. An investigation of the reliability 
of five procedures for grading English themes. Research in the 
Teaching of English , 1967, 190-200. ~ 

Hambleton, R. , Swaminathan, H. , Algina, J. , & Coulson, D. Criterion 
referenced testing ai# measurement: A review of technical issues 
and developments. Review of Educational Research , 1977, 48(1). 

Harris, C. W. Some technical characteristics of mastery tests. In 
C. W. Harris, M. C. Alkin, & W. J. Popham (Eds.), Problems in 
criterion- referenced testing . CSE Ponograph Series in Evaluation, 
No. 3. Los Angeles: Center for the Study of Evaluation, University 
of California, 1974. 

Howerton, M. C. , Jacobson, M. & Eldon, R. The relationship between 
quantitative and qualitative measures of writing skill. Paper 
presented at the Annual Meeting of the American Educational 
Research Association, New York, April 1977. 

Hunt, K. W. Early blooming and late blooming syntactic structures. In 
C. Cooper & L. Odell (Eds.), Evaluating writing . State University 
of New York at Buffalo, 1977. 

Lloyd-Jones, R, Primary trait scoring. In C. Cooper & L. Odell (Eds.). 
Evaluating writing: Describing, measuring, judgin g. Urbana, Ill.i 
National Council of Teachers of English, 1977. 

Millman, J. Criterion-referenced measurement. In W. J. Popham (Ed.), 
Evaluation in education: Current applications . Berkeley, Calif.: 
McCutchan, 1974. 

O'Hare, F» Sentence combining: Improving student writing without formal 
grammar instruction. NCTE Research Report No. 15. Urbana, 111.: 
National Council of Teachers of English, 1973. 



323 



